@jun_song: If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game …

X AI KOLs Following 05/10/26, 08:07 AM News

Summary

The author speculates that loading only active parameters of MoE models onto GPUs could drastically improve efficiency and allow running large models like Kimi locally, though acknowledges this is currently impractical.

If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game over. Data centers would see a 100x efficiency boost. And we could literally run 1T models like Kimi locally on just 32GB VRAM. Yeah I know it's basically impossible right now, but who knows what the future holds. Let me dream.

Original Article

Similar Articles

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.

What is the point of MoE models, beyond being faster?

Reddit r/LocalLLaMA

A discussion about the advantages of Mixture of Experts (MoE) models over dense models beyond speed, considering RAM constraints and scaling limits.

Are the rich RAM /poor GPU people wrong here?

Reddit r/LocalLLaMA

Discusses the trade-off between dense and Mixture-of-Experts (MoE) models for local AI, noting that high-RAM users have limited MoE options beyond Qwen 3.5 122B, and questioning if large GPU is the only viable path.

@witcheer: can’t believe gpt-oss-20b perfs on 8GB vRAM 21B total params, 3.6B active (MoE). OpenAI, Apache 2.0. uses only 1.8 GB V…

X AI KOLs Timeline

A new open-source MoE model, gpt-oss-20b (21B total, 3.6B active), runs on only 1.8GB VRAM and achieves perfect scores on agentic coding tasks, outperforming other local models like Gemma and Qwen.

@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …

X AI KOLs Timeline

User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.

Similar Articles

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

What is the point of MoE models, beyond being faster?

Are the rich RAM /poor GPU people wrong here?

@witcheer: can’t believe gpt-oss-20b perfs on 8GB vRAM 21B total params, 3.6B active (MoE). OpenAI, Apache 2.0. uses only 1.8 GB V…

@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …

Submit Feedback