Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog
…These collectives leverage SHARP, in-network reductions, and multicast acceleration features of NVIDIA NVLINK Switch to enable latency-optimized one-shot and throughput-optimized two-shot AllReduce algorithms. The underlying CUDA interface…
