Paper page - Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode
… This workload is usually described as memory-bandwidth-bound . Each decode step streams model weights and the active KV cache , so latency should scale with peak HBM bandwidth . …