Extract More Kernel Performance with NVIDIA CompileIQ Auto-Tuning | NVIDIA Technical Blog
… Scaled dot-product attention, fused and flash attention variants account for another 25%. Together, these two kernel families represent more than 90% of end-to-end inference compute. …