Multi Tier MoE Caching

Reddit r/LocalLLaMA News

Summary

Discusses multi-tier caching strategies for MoE models to improve inference speed by keeping frequently activated experts on GPU, referencing existing implementations like PowerInfer and llama.cpp branches.

I've never seen much discussion around this, but it feels like where MoE inference is heading. The bulk of big models we use, GLM 5.2, Deepseek V4, Stepfun, Minimix are MoE meaning inference is run on a small subsection of the experts. Currently we scatter these experts over a mixture of CPU and GPU ram, giving us an aggregate speed of the two pipelines combined. A fairly typical system may look like: 128gb of DDR5 6000mhz at ~48gb/s 24gb of GDDR6X at ~936gb/s Assuming all memory is used, we have a combined bandwidth of about ~188gb/s I added some debugging to see the standard activation in something like Qwen3.6 35b, when processing a large C# codebase, multiple prompts on top to fill up my context. I get this: Top 1% of experts represents 20% of activations. Top 5% of experts represents 50% of activations. Top 10% of experts represents 70% of activations. Top 15% of experts represents 80% of activations. Top 20% of experts represents 85% of activations. Meaning if I could shift just 20% of my experts (or layers/tensors) to the GPU, I should get 85% of activations running at full speed. Caches could adapt to the session over time, perhaps even maintaining separate hot sets for coding, creative writing, etc. This isn't a new idea. There are quite a few papers on hierarchical caching and expert prefetching, and some practical implementations already exist: PowerInfer (how the Tiiny.ai box claims to be able to run 122b models): https://github.com/Tiiny-AI/PowerInfer Lidenburg's llama.cpp branch: https://github.com/Lidenburg/llama.cpp HOBBIT, FlashMoE, Fiddler, DuoServe-MoE, M2Cache, etc. I'm curious what others think, know of any work happening in the area etc. It's obviously mainly focused on advancements to hybrid ram/vram setups, but still touches on things like the recent developments to allow running of models from nvme on Mac.
Original Article

Similar Articles

Mixture of Experts (MoEs) in Transformers

Hugging Face Blog

Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.

MobileMoE: Scaling On-Device Mixture of Experts

Hugging Face Daily Papers

MobileMoE introduces efficient on-device mixture-of-experts language models with sub-billion parameters, achieving better performance and efficiency than dense baselines and existing MoE models. The models are trained on open-source datasets and demonstrate significant speedups on commodity smartphones.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.