@leopardracer: GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can …

X AI KOLs Timeline 06/10/26, 12:55 PM Models

local-ai gemma-4 quantization llama-cpp consumer-gpu 248k-context

Summary

Gemma 4 26B runs on an RTX 4060 with 248K token context at 20 tokens per second using llama.cpp and Q4_K_XL quantization, enabling local processing of entire codebases on consumer hardware.

GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW 20 tokens per second and a context window so large you can feed it entire codebases, books and research papers in a single prompt this is not a cloud api and not a server rack, this is a regular consumer gpu running locally with llama.cpp and q4_k_xl quantization 248k context on an 8gb vram card was not supposed to be possible and here it is just running on someone’s desk the article below covers exactly which tools and configs make this kind of setup work in 2026 ↓

Original Article

View Cached Full Text

Cached at: 06/10/26, 07:56 PM

GEMMA 4 26B ON AN RTX 4060 WITH A 248K TOKEN CONTEXT WINDOW

20 tokens per second and a context window so large you can feed it entire codebases, books and research papers in a single prompt

this is not a cloud api and not a server rack, this is a regular consumer gpu running locally with llama.cpp and q4_k_xl quantization

248k context on an 8gb vram card was not supposed to be possible and here it is just running on someone’s desk

the article below covers exactly which tools and configs make this kind of setup work in 2026 ↓

Similar Articles

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…

X AI KOLs Timeline

Alok demonstrates running Gemma 4 26B MoE on 8GB VRAM using Unsloth's QAT quant and the -cmoe flag in llama.cpp, achieving 20 tokens/sec with 250k context, marking a major milestone for budget local AI.

@leopardracer: SAME GPU SAME MODEL SAME CONTEXT AND 2X THE SPEED rtx 4060, gemma 4 12b, 48k context just switched the quantization fro…

X AI KOLs Timeline

Changing quantization from q4_k_m to q4_k_xl in llama.cpp doubles inference speed on the same GPU without hardware or driver changes, as demonstrated with Gemma 4 12B on an RTX 4060.

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Reddit r/LocalLLaMA

A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.

Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed

Reddit r/artificial

A user reports running Google's Gemma 4 12B model locally on a single RTX 3090 via GGUF quantization, finding strong performance including real 256k context, multimodal capabilities, and function calling that outperforms larger 70B models for coding tasks.