You don't need a GPU to run gemma-4-26B-A4B

Reddit r/LocalLLaMA Models

Summary

The author demonstrates that the Gemma-4-26B-A4B model runs efficiently on a CPU-only system using Koboldcpp, achieving 7 tokens per second on an old desktop, suggesting that powerful GPUs may not be necessary for local LLM inference.

I've been running LLMs on my old potato i5-8500 with 32GB of RAM and \*no GPU\* for awhile now, running up to 12B dense models which run slow but perfectly useable. But this Gemma-4-26B-A4B simply flies on this CPU - only machine using Koboldcpp on Linux. That's right, an old used $150 desktop computer is running state of the art LLMs with something like 7 T/s. Yeah, go ahead and scoff. You can brag about your super-rig that costs more than a used car, but I'm bragging about a crappy old desktop I bought of ebay running the same thing that costs less than a night out. I keep thinking about buying a GPU but it's beginning to look like it might not be necessary. These smaller models are amazing without a GPU.
Original Article

Similar Articles

A 10 year old Xeon is all you need

Hacker News Top

A blog post detailing how to run the Gemma 4 AI model on a 10-year-old Xeon server with only CPU and DDR3 RAM, using customized llama.cpp optimizations.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.