24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA 05/13/26, 08:41 PM Tools

moe-inference llama-cpp quantization pcie-offloading speculative-decoding memory-optimization 8gb-vram

Summary

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

I got **Qwen 3.6 35B-A3B** and **Gemma 4 26B-A4B** running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM). **Results (Q4\_K\_M models, 128k context):** |Model|tok/s|Key flags| |:-|:-|:-| |Qwen 3.6 35B-A3B|\~24| \--n-cpu-moe 30, K=turbo4 V=turbo3| |Gemma 4 26B-A4B (no MTP) |\~20|\--n-cpu-moe 20, K=V=turbo3, --flash-attn| |Gemma 4 26B-A4B + MTP (naive)|\~21|embedding table silently on CPU| |Gemma 4 26B-A4B + MTP (fixed)|\~24.5|\--override-tensor-draft "token\_embd\\.weight=CUDA0"| The trick is MoE offloading: llama.cpp can park the cold expert weights in system RAM, and stream over PCIe to the GPU, while keeping hot layers + KV cache on GPU. The system is fully PCIe bandwidth-limited (GPU sits at \~40-50% utilisation while PCIe 3.0 x16 is maxed out). **Biggest finding:** Gemma 4's MTP speculative decoding barely helps out of the box (\~5% gain). Turns out llama.cpp unconditionally keeps the token embedding table on CPU. Normally that's fine (just a `get_rows` lookup), but Gemma 4's MTP assistant has a tied LM head - so every draft token does a full 262k×1024 matmul across PCIe. Forcing it onto GPU with `--override-tensor-draft` gives the real \~22% speedup and \~79% draft acceptance rate. **Setup pain points (Fedora 42 + Pascal GPU):** * Pin akmod-nvidia to 580xx branch (Pascal is going legacy) * Force gcc-14 for CUDA 12.9 (newer gcc rejected) * Patch CUDA's math\_functions.h for glibc 2.41 compatibility * Used the AtomicBot-ai/atomic-llama-cpp-turboquant fork for both TurboQuant cache + Gemma MTP support [Full blog post with all the grindy build details](https://mdda.net/blog/tech/dl/llama-cpp-moe-on-an-old-gtx-1080) (every command, and the debugging deep-dive into the MTP embedding table issue) I'm also planning a YouTube video walkthrough soon - I'll update when that's live. Happy to answer questions about the setup.

Original Article

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Similar Articles

@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…

@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …

@outsource_: NEW GLM+ QWEN 18B RUNS ON CONSUMER GPU IT BEATS 35B MoE AT HALF THE VRAM @KyleHessling1 just dropped the healed Qwopus-…

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown

Submit Feedback

Similar Articles

@analogalok: Run Gemma 4 26B MoE on 8GB VRAM with 250k context at 20+ tokens/sec If you own any 8GB VRAM graphics card, stop what yo…

@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …

@outsource_: NEW GLM+ QWEN 18B RUNS ON CONSUMER GPU IT BEATS 35B MoE AT HALF THE VRAM @KyleHessling1 just dropped the healed Qwopus-…
A new 18B merged quantized model, Qwopus-GLM-18B-GGUF, outperforms 35B MoE models while using half the VRAM and running on consumer GPUs.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Qwen 3.5 122B MoE OC on a single 3090 at 35 t/s — full local stack breakdown