Search

Showing top 83 results for "GPU needs for LLMs"

People also ask

Why is over-quota GPU resource fairness important?

Enterprise deployments have shown a consistent pattern: when organizations move from static GPU allocation to dynamic scheduling, cluster usage becomes far more dynamic. Over-quota resources (the shared pool beyond guaranteed quotas) become one of the most heavily utilized resource types. Teams regularly exceed their guaranteed allocations, resulting in higher GPU utilization and more compute time for researchers. This makes over-quota fairness critical. When a significant portion of cluster value comes from this shared pool, that pool needs to be divided fairly over time. The classical statel

Ensuring Balanced GPU Allocation in Kubernetes Clusters with Time-Based Fairshare | NVIDIA Technical Blog

Data Center / Cloud – NVIDIA Technical Blog

…You can optimize for specific GPU configurations and achieve... 9 MIN READ Jan 08, 2026 Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM Large language models…

May 12, 2026

Simulation / Modeling / Design – NVIDIA Technical Blog

May 12, 2026

Computer Vision / Video Analytics – NVIDIA Technical Blog

May 12, 2026

Agentic AI / Generative AI – NVIDIA Technical Blog

May 12, 2026

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy | NVIDIA Technical Blog

…Keep PyTorch as the model definition while relying on TensorRT LLM for runtime integration AutoDeploy currently supports more than 100 text‑to‑text LLMs and offers early support for VLMs and SSMs…

Feb 9, 2026 · Lucas Liebenwein

Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

…samples. The script for this process is provided below, showing how to prune using a two-GPU pipeline parallel setup. torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py…

Oct 7, 2025 · Max Xu

AR / VR – NVIDIA Technical Blog

…Your Essential Tool for Measuring GPU Interconnect and Memory Performance When you’re writing CUDA applications, one of the most important things you need to focus on to write great code is…

May 22, 2026

2 sources covering this — show 1 more

Developer Tools & Techniques – NVIDIA Technical Blog developer.nvidia.com

Maximizing Memory Efficiency to Run Bigger Models on NVIDIA Jetson | NVIDIA Technical Blog

…SGLang enables flexible and programmable inference workflows. Llama.cpp and NVIDIA TensorRT Edge-LLM are optimized for memory-efficient execution in resource-constrained environments. These frameworks provide the infrastructure needed to serve…

Apr 20, 2026 · Anshuman Bhat

NVIDIA Nemotron AI Models

…NVIDIA TensorRT-LLM TensorRT™-LLM is an open-source library built to deliver high-performance, real-time inference optimization for large language models like Nemotron on NVIDIA GPUs. This open-source library…

Followed topics

People also ask

Data Center / Cloud – NVIDIA Technical Blog

Simulation / Modeling / Design – NVIDIA Technical Blog

Computer Vision / Video Analytics – NVIDIA Technical Blog

Agentic AI / Generative AI – NVIDIA Technical Blog

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy | NVIDIA Technical Blog

Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer | NVIDIA Technical Blog

AR / VR – NVIDIA Technical Blog

Maximizing Memory Efficiency to Run Bigger Models on NVIDIA Jetson | NVIDIA Technical Blog

NVIDIA Nemotron AI Models