AR / VR – NVIDIA Technical Blog
…12 MIN READ May 21, 2026 Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters Maximizing the value of AI infrastructure demands deep visibility into GPU utilization. Yet many platform teams…
Tracked topic
The operational payoff of running Slurm on Kubernetes comes from the ecosystem. Rather than building and maintaining separate toolchains for GPU management, monitoring, networking, and node lifecycle, you can use the Kubernetes tooling that already exists for these problems. Platform teams manage clusters with declarative YAML, Helm deployments, rolling updates, and Prometheus or Grafana for observability.
Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical BlogSlinky slurm-operator represents each Slurm component (slurmctld for scheduling, slurmdbd for accounting, slurmd for compute workers, slurmrestd for API access) as a Kubernetes Custom Resource Definition (CRD). A Slurm cluster is defined using Custom Resources, and Slinky creates containerized Slurm daemons running in their own pods, configured to belong to their respective cluster. Slinky ensures high availability (HA) of the Slurm control plane (slurmctld) through pod regeneration, with no need for the Slurm native HA mechanism. Configuration changes propagate automatically: Kubernetes synch
Running Large-Scale GPU Workloads on Kubernetes with Slurm | NVIDIA Technical BlogThe GPU Usage Monitor is an open-source project that deploys a fully integrated GPU observability stack for Kubernetes. Rather than requiring SRE and platform teams to assemble and configure individual components, the GPU Usage Monitor uses DCGM Exporter, kube-state-metrics, Prometheus, and Grafana into a single deployment, complete with pre-built dashboards designed specifically for GPU-accelerated workloads. The design principle is operational simplicity. A single helm install command results in actionable GPU visibility within minutes, with no custom dashboard authoring or scrape configurat
Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA Technical BlogNVSentinel is installed in each Kubernetes cluster run. Once deployed, NVSentinel continuously watches nodes for errors, analyzes events, and takes automated actions such as quarantining, draining, labeling, or triggering external remediation workflows. Specific NVSentinel features include continuous monitoring, data aggregation and analysis, and more, as detailed below.
Automate Kubernetes AI Cluster Health with NVSentinel | NVIDIA Technical Blog…12 MIN READ May 21, 2026 Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters Maximizing the value of AI infrastructure demands deep visibility into GPU utilization. Yet many platform teams…
…For enterprise Kubernetes deployments, the SDK documentation includes an NGINX Ingress configuration that supports multiple CloudXR servers with load balancing. Ensure your firewall allows TCP port 49100 (signaling), UDP port 47998 (media…
…NVIDIA NVSentinel provides Kubernetes-native GPU fault detection and automated remediation, cordoning unhealthy compute nodes and draining workloads in seconds rather than minutes or hours. NVIDIA Fleet Intelligence provides fleet-wide visibility…
…Integrating AIConfigurator in the AI Serving Stack for automated deployments The AI Serving Stack , built on the Alibaba Container Service for Kubernetes (ACK), is an end-to-end solution for efficient and…
…NVIDIA Run:ai’s dynamic GPU fractions solve this by replacing fixed allocations with a request/limit model, borrowing Kubernetes resource semantics for GPU memory: Request: The guaranteed minimum fraction, always reserved…
…MCG runs on-premises or in your own cloud, with Kubernetes support to help you spin up on your own infrastructure. Performance results We ran the toolkit through standardized testing on public…
…がアプリケーションにレンダリングと物理機能を提供していますが、 ovstorage は統合ストレージ層として機能します。統合された API 層を介して、PLM または既存のリポジトリを Omniverse エコシステムに直接接続します。 これにより、同期ジョブとコストのかかるデータ移行が不要になり、ファイルを移動することなく USD ワークフローが可能になります。 Kubernetes 対応のヘッドレス デプロイ向けに設計された ovstorage は、アーキテクチャ全体を制御して、マイクロサービスを独立してスケーリングし、モノリシックなレガシー スタックの制約を受けずに、本番環境の需要を満たします。 始め方 既存のインフラを統合する: Omniverse を現在のストレージ バックエンド (S3 または…
…Other JetPack Components Cloud-Native Design Cloud-native design on Jetson helps you create scalable AI applications at the edge with containerized development, Kubernetes, and microservices, bridging cloud and edge development. Nsight…
…He lead Cloud Solutions, Ethernet & InfiniBand SW management, storage, automation solutions, and upstream activities such as Ansible, Kubernetes, OpenStack, puppet, chef, and more. Slama holds a patent in the field of ML…