Gemma 12b less than 10 watts 6.5pp 1.3tg

Reddit r/LocalLLaMA News

Summary

Running Gemma 12B model on a Google Pixel 10 Pro using llama.cpp achieves 6.5 tokens per second prompt processing and 1.3 tokens per second generation with under 10 watts power consumption, demonstrating efficient on-device AI inference.

Google pixel 10 pro Termux Llamacpp version: 9639 (ef8268fee) $ ./llama.cpp/build\_vulkan/bin/llama-cli -m storage/downloads/gemma-4-12b-it-UD-Q3\_K\_XL.gguf --model-draft storage/downloads/mtp-gemma-4-12b-it.gguf --temp 1.0 --top-p 0.95 --top-k 64 --spec-type draft-mtp --spec-draft-n-max 1 -c 32000 --mlock -b 512 -ctk q8\_0 -ctv q8\_0 \~10,000 prompt depth \[ Prompt: 6.5 t/s | Generation: 1.3 t/s \]
Original Article

Similar Articles

You don't need a GPU to run gemma-4-26B-A4B

Reddit r/LocalLLaMA

The author demonstrates that the Gemma-4-26B-A4B model runs efficiently on a CPU-only system using Koboldcpp, achieving 7 tokens per second on an old desktop, suggesting that powerful GPUs may not be necessary for local LLM inference.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.