Accelerating Long-Context Model Training in JAX and XLA | NVIDIA Technical Blog
…primitives such as reductions and broadcast are offloaded to the switch. Both of these features can demonstrate useful compute-communication operations pipelining as most or all of the GPU SMs are available…
