Search

Showing top 82 results for "GPU needs for LLMs"

Advancing Emerging Optimizers for Accelerated LLM Training with NVIDIA Megatron | NVIDIA Technical Blog

… Because whole layers vary in size, each GPU needs to collect differently sized parameter updates from different GPUs through all gatherv . …

Apr 22, 2026 · Hao Wu

Maximize AI Infrastructure Throughput by Consolidating Underutilized GPU Workloads | NVIDIA Technical Blog

… A pod requests nvidia.com/gpu: 1 , and the scheduler binds it to a physical device. Large language models LLMs like NVIDIA Nemotron, Llama 3, or Qwen 7B/8B require dedicated compute to maintain low time to first token TTFT and high batch throughput. …

Mar 25, 2026 · Sagar Desai

Removing the Guesswork from Disaggregated Serving | NVIDIA Technical Blog

… From a user’s perspective, comparing backends is a one-flag change: TensorRT LLM aiconfigurator cli default \ --model-path nvidia/Qwen3-32B-NVFP4 \ --total-gpus 64 --system b200 sxm \ --backend trtllm SGLang aiconfigurator cli default \ --model-path nvidia/Qwen3-32B-NVFP4 \ --total-gpus 64 --system… …

Mar 9, 2026 · Tianhao Xu

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare | NVIDIA Technical Blog

… The job needs 60 GPUs; their 20 GPU quota plus 40 from the over-quota pool. …

Jan 28, 2026 · Ekin Karabulut

Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM | NVIDIA Technical Blog

… The GPU spends more time moving KV cache data than computing. …

Dec 16, 2025 · Laikh Tewari

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai | NVIDIA Technical Blog

… Primary metrics include: TTFT: Latency from request submission to first response token Output throughput: Tokens generated per second per session GPU utilization: Percentage of GPU memory consumed under load Concurrency scaling: Maximum simultaneous users supported while maintaining TTFT and throug… …

Feb 18, 2026 · Boskey Savla

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical Blog

… GPUs can train when inference load is low, automatically yielding resources when user-facing requests arrive. GPU fractions with bin packing for multiple small models on a GPU Many NIM workloads, like embeddings, rerankers, and small LLMs, rarely need an entire GPU. …

Feb 27, 2026 · Shwetha Krishnamurthy

Post-Training Quantization of LLMs with NVIDIA NeMo and NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

… All the quantized variants of the Llama 3 70B model can be served using only one NVIDIA H100 GPU while the baseline FP16 precision requires at least two GPUs. …

Sep 10, 2024 · Jan Lasek

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA Technical Blog

… You want to maximize your GPUs for high-throughput and can parallelize aggressively. Decode workers generate output tokens one at a time. This is memory-bandwidth-bound because of the autoregressive nature of LLMs. You want GPUs with fast high bandwidth memory HBM access. …

Mar 23, 2026 · Anish Maddipoti

LLM Inference Benchmarking: How Much Does Your LLM Inference Cost? | NVIDIA Technical Blog

… He has contributed to production applications of LLMs covering RAG systems, optimization of inference servers, pretraining of LLMs from scratch, custom evaluation of LLMs, or quantization using FP8 formats. …

Jun 18, 2025 · Vinh Nguyen

Followed topics