The latest Gemma 4 models use a training trick to slash their on-device memory footprint
… The Gemma 4 models optimized with QAT are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B, and Gemma 4 31B. …
Tracked topic
Gemma is a family of open-weight language models released by Google for text generation and related NLP tasks.
The process uses a technique called “Speculative Decoding,” in which the drafter models predict upcoming words in the prompt even before the main Gemma model has read through it. While the drafter moves on to the next sequence of words, the main model verifies the predicted set of words at the same time.
Google's latest trick gets Gemma 4 running 3x faster right on your phone… The Gemma 4 models optimized with QAT are available in five sizes: Gemma 4 E2B, Gemma 4 E4B, Gemma 4 12B, Gemma 4 26B A4B, and Gemma 4 31B. …
… The new model performs close to the Gemma 4 26B MoE model in benchmarks. Back in April, Google released its mobile-friendly Gemma E2B and E4B models , bringing on-device multimodal AI to Android and iOS devices. …
… So, Google is now offering a potential solution, which it claims can speed up Gemma 4 models by up to three times. Google recently released Multi-Token Prediction MTP drafters for Gemma 4. …
… The headline feature of the public release is support for Gemma 4. This isn’t just a minor iteration; Gemma 4 is built on the same architecture as Gemini 3, but offers improvements in logic, multilingual support, and a 256K context window. …
… Gemma 4 promises to be Google's fastest and smartest on-device AI tools. That all sounds pretty impressive, and to help developers make a head start on integrating these models with their Android apps, Google has released early access to Gemini Nano 4 via an AICore Developer Preview. …