Tag
Discusses multi-tier caching strategies for MoE models to improve inference speed by keeping frequently activated experts on GPU, referencing existing implementations like PowerInfer and llama.cpp branches.