Search: kernel hardware requirements

Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark | NVIDIA Technical Blog

… The kernel considered is the attention decode kernel and the kernel is optimized using Tile IR. Performance scaling and optimization headroom In Figure 1, the vertical positioning of the data points on the y-axis confirms that the kernel achieves higher hardware utilization on NVIDIA B200. …

Mar 16, 2026 · Allen Bourgoyne

How NVIDIA Extreme Hardware-Software Co-Design Delivered a Large Inference Boost for Sarvam AI’s Sovereign Models | NVIDIA Technical Blog

… He takes care of model optimization across target hardware and also maintains accelerator infrastructure at Sarvam. He likes to dive deep into model architecture, kernels, and hardware. …

Feb 18, 2026 · Utkarsh Uppal

cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia | NVIDIA Technical Blog

… The package maintains close syntax and abstraction parity with the cuTile Python version, making it easy to port code and leverage Python documentation, while using Julia-specific features like 1-based indexing and broadcasting. cuTile.jl achieves near-identical performance to the Python implementa… …

Mar 3, 2026 · Tim Besard

Achieving Single-Digit Microsecond Latency Inference for Capital Markets | NVIDIA Technical Blog

… Inference was implemented using a persistent kernel approach, meaning the kernel remains active throughout the application’s lifetime. This persistence improves performance by loading weights into shared memory and registers only once during kernel initialization. …

Apr 2, 2026 · Nikolay Markovskiy

Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile | NVIDIA Technical Blog

Developer Tools & Techniques Develop High-Performance GPU Kernels in C++ with NVIDIA CUDA Tile May 26, 2026 By Jonathan Bentz and Tony Scudiero Discuss 0 Discuss 0 L T F R E AI-Generated Summary Like Dislike NVIDIA CUDA Tile C++ enables tile-based GPU kernel programming within existing C++ codebase… …

May 26, 2026 · Jonathan Bentz

Validate Kubernetes for GPU Infrastructure with Layered, Reproducible Recipes | NVIDIA Technical Blog

… Prior to deploying anything, a readiness check compares recipe constraints against your snapshot: Kubernetes version, OS, kernel, and GPU hardware. aicr validate \ --recipe recipe.yaml \ --phase readiness After deployment, subsequent phases validate component health and conformance. …

Mar 12, 2026 · Mark Chmarny

Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight | NVIDIA Technical Blog

… In this setting, constant per-kernel overhead and little work per kernel lead to an unfavorable ratio between overhead and actual work. Changing this requires altering the paradigm from many small kernels to a few larger kernels. …

Apr 2, 2026 · Andreas Kieslinger

CUDA Tile Programming Now Available for BASIC! | NVIDIA Technical Blog

… Get setup First, install cuTile BASIC with PIP: pip install git+https://github.com/nvidia/cuda-tile.git@basic-experimental The full hardware and software requirements for running cuTile BASIC are listed at the end of this post 64k of RAM or more recommended . cuTile BASIC example If you’ve learned … …

Apr 1, 2026 · Rob Armstrong

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute | NVIDIA Technical Blog

… Using cuda.compute we can solve this using pure Python by calling a device-wide primitive. import cuda.compute from cuda.compute import OpKind Build-time tensors used to specialize the callable build A = torch.empty 2, 2, dtype=torch.float16, device="cuda" build B = torch.empty 2, 2, dtype=torch.fl… …

Feb 18, 2026 · Daniel Rodriguez

Maximizing Memory Efficiency to Run Bigger Models on NVIDIA Jetson | NVIDIA Technical Blog

… Analyze and measure hardware memory usage In addition to CPU memory, GPU and multimedia allocations can impact available memory. $ sudo cat /sys/kernel/debug/nvmap/iovmm/clients This shows memory usage across processes using NvMap e.g., CUDA, video pipelines . …

Apr 20, 2026 · Anshuman Bhat

Followed topics