Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core | NVIDIA Technical Blog
…The pipeline bubble also differs and needs to be distributed evenly to each microbatch for the end-to-end balance. Equalizing the end-to-end training time across DP ranks suggests: \(W…
