Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM | NVIDIA Technical Blog
…If the condition is met, the kernel skips the softmax and BMM2 calculation for that block and, crucially, skips loading the \(V\) block from High Bandwidth Memory (HBM). What are the benefits…