Accelerating Long-Context Inference with Skip Softmax in NVIDIA TensorRT LLM | NVIDIA Technical Blog
…It then applies softmax to normalize these scores into probabilities (\(P\)) and multiplies them by values (\(V\)). However, attention is intrinsically sparse . For many blocks, the attention scores are so low compared…
