Gemma 12b less than 10 watts 6.5pp 1.3tg
Summary
Running Gemma 12B model on a Google Pixel 10 Pro using llama.cpp achieves 6.5 tokens per second prompt processing and 1.3 tokens per second generation with under 10 watts power consumption, demonstrating efficient on-device AI inference.
Similar Articles
@analogalok: i just ran Google's brand new Unsloth Gemma4 12B dense GGUF on my RTX 4060 using llama.cpp + CUDA 13.2 21 tokens per se…
Google's new Gemma 4 12B is a single decoder-only transformer with encoder-free multimodal input, achieving strong benchmarks while being small enough to run locally on a budget GPU. It is released under Apache 2.0 license.
Introducing Gemma 3 270M: The compact model for hyper-efficient AI
Google introduces Gemma 3 270M, a compact 270-million parameter model designed for efficient on-device AI with strong instruction-following capabilities and extreme energy efficiency (0.75% battery for 25 conversations on Pixel 9 Pro).
You don't need a GPU to run gemma-4-26B-A4B
The author demonstrates that the Gemma-4-26B-A4B model runs efficiently on a CPU-only system using Koboldcpp, achieving 7 tokens per second on an old desktop, suggesting that powerful GPUs may not be necessary for local LLM inference.
Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5
Gemma 4 is demonstrated running in-browser via WebGPU at 255 tokens per second, using kernels generated by Fable 5, showcasing efficient on-device inference.
120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP
Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.