Scaling the Memory Wall: The Rise and Roadmap of HBM
… Let's go through how HBM is used, and where the pressures are. HBM Usage in Inference In LLM inference, all the model weights reside permanently in the on-package HBM memory so the GPU can fetch them without delay. Alongside the weights, HBM also holds the KVcache. …
