Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus | NVIDIA Technical Blog
… Observing changes on dashboards and correlating job-level degradations with underlying NCCL or network-layer metrics enables targeted triage based on where the anomaly originates. …