Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus | NVIDIA Technical Blog
… Time series-based Grafana dashboards Figure 3 shows an example of how time series dashboards look using the Prometheus labels categorized into NVLink collective dashboards and mixed i.e., Network + NVLink collectives: Use cases for NCCL inspector To demonstrate the triage workflow, these two use ca… …