Multi Tier MoE Caching

Reddit r/LocalLLaMA 06/23/26, 07:21 AM News

moe caching inference mixture-of-experts gpu cpu hybrid-memory expert-caching llm-inference

Summary

Discusses multi-tier caching strategies for MoE models to improve inference speed by keeping frequently activated experts on GPU, referencing existing implementations like PowerInfer and llama.cpp branches.

I've never seen much discussion around this, but it feels like where MoE inference is heading. The bulk of big models we use, GLM 5.2, Deepseek V4, Stepfun, Minimix are MoE meaning inference is run on a small subsection of the experts. Currently we scatter these experts over a mixture of CPU and GPU ram, giving us an aggregate speed of the two pipelines combined. A fairly typical system may look like: 128gb of DDR5 6000mhz at ~48gb/s 24gb of GDDR6X at ~936gb/s Assuming all memory is used, we have a combined bandwidth of about ~188gb/s I added some debugging to see the standard activation in something like Qwen3.6 35b, when processing a large C# codebase, multiple prompts on top to fill up my context. I get this: Top 1% of experts represents 20% of activations. Top 5% of experts represents 50% of activations. Top 10% of experts represents 70% of activations. Top 15% of experts represents 80% of activations. Top 20% of experts represents 85% of activations. Meaning if I could shift just 20% of my experts (or layers/tensors) to the GPU, I should get 85% of activations running at full speed. Caches could adapt to the session over time, perhaps even maintaining separate hot sets for coding, creative writing, etc. This isn't a new idea. There are quite a few papers on hierarchical caching and expert prefetching, and some practical implementations already exist: PowerInfer (how the Tiiny.ai box claims to be able to run 122b models): https://github.com/Tiiny-AI/PowerInfer Lidenburg's llama.cpp branch: https://github.com/Lidenburg/llama.cpp HOBBIT, FlashMoE, Fiddler, DuoServe-MoE, M2Cache, etc. I'm curious what others think, know of any work happening in the area etc. It's obviously mainly focused on advancements to hybrid ram/vram setups, but still touches on things like the recent developments to allow running of models from nvme on Mac.

Original Article

Multi Tier MoE Caching

Similar Articles

What is the point of MoE models, beyond being faster?

@jun_song: If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game …

Mixture of Experts (MoEs) in Transformers

MobileMoE: Scaling On-Device Mixture of Experts

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Submit Feedback

Similar Articles

What is the point of MoE models, beyond being faster?

@jun_song: If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game …

Mixture of Experts (MoEs) in Transformers

MobileMoE: Scaling On-Device Mixture of Experts

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)