Making Softmax More Efficient with NVIDIA Blackwell Ultra | NVIDIA Technical Blog
… This reduction in softmax latency tightens the entire pipeline. The gap between BMM1 and BMM2 is drastically minimized, allowing the Tensor Cores to switch between the query-key multiplication and the probability-value multiplication with minimal stalling. …