Tag
Gemma 4 12B QAT (dense) achieves over 1000 tokens per second prefill on an 8GB RTX 4060 with 120k context using TurboQuant, enabling full GPU layer offloading. This represents a 42% increase in prefill speed over previous methods.
User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.
Introduces running a fully autonomous AI Agent on an entry-level laptop with 8GB VRAM using the Gemma 4 26B model and Hermes Desktop tool, enabling local file operations, code modification, web browsing, etc., significantly lowering the barrier for local Agents.
Built a repository to train a tiny language model (25M parameters) from scratch on 8GB VRAM, with support for MTP but noting limitations of mHC and BitNet.
A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.