Search: Kernel scheduling improvements

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models | NVIDIA Technical Blog

… How disaggregated serving removes the critical path and boosts throughput 1.5x Despite kernel and scheduling improvements, our profiling indicated that inter-GPU communication for token distribution expert parallelism remained on the critical path. …

Feb 18, 2026 · Utkarsh Uppal

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight | NVIDIA Technical Blog

… Each kernel launch has several associated overheads, like scheduling and kernel resource management. In this setting, constant per-kernel overhead and little work per kernel lead to an unfavorable ratio between overhead and actual work. …

Apr 2, 2026 · Andreas Kieslinger

Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning | NVIDIA Technical Blog

… NVIDIA GPU compilers apply the same default heuristics register allocation strategies, instruction scheduling decisions, loop unrolling thresholds, etc. to every kernel they compile. …

May 26, 2026 · Aditya Srikanth

MiniMax M2.7 Advances Scalable Agentic Workflows on NVIDIA Platforms for Complex AI Applications | NVIDIA Technical Blog

… The kernel can better overlap computation and communication, reducing kernel launch and memory read/write overhead, and improving inference performance. FP8 MoE: Integration of NVIDIA TensorRT-LLM FP8 MoE modular kernel. …

Apr 12, 2026 · Anu Srivastava

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates | NVIDIA Technical Blog

… The following are quick examples of how to use cuda.core APIs. from cuda.core import Device, Stream, Program, ProgramOptions, LaunchConfig, launch pick and activate a GPU dev = Device dev.set current create a CUDA stream stream = dev.create stream NVRTC compile + lookup prog = Program src, code typ… …

May 26, 2026 · Jonathan Bentz

Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core | NVIDIA Technical Blog

… To minimize invasive changes to the existing scheduling logic, a lightweight data iterator wrapper around the original data iterator is introduced. It performs three steps: Rescheduling and packing sequences in the global batch create a balanced workload across DP ranks. …

Jan 28, 2026 · Kunlun Li

Followed topics

Search

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models | NVIDIA Technical Blog

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight | NVIDIA Technical Blog

Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning | NVIDIA Technical Blog

MiniMax M2.7 Advances Scalable Agentic Workflows on NVIDIA Platforms for Complex AI Applications | NVIDIA Technical Blog

NVIDIA CUDA 13.3 Enhances GPU Development with Tile Programming in C++, Compiler Autotuning, and Python Updates | NVIDIA Technical Blog

Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core | NVIDIA Technical Blog

Advancing GPU Programming with the CUDA Tile IR Backend for OpenAI Triton | NVIDIA Technical Blog

Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes | NVIDIA Technical Blog

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel | NVIDIA Technical Blog

NVIDIA DirectX