Tag
This paper proposes φ-balancing, a principled framework for load balancing in Mixture-of-Experts models that directly targets population-level expert balance using convex duality and mirror descent, achieving more stable expert utilization and outperforming prior methods on reasoning and code generation benchmarks.
A practical blueprint for designing a backend system capable of handling 1 million concurrent users, covering architecture decisions like language selection, load balancing, database sharding, multi-layer caching, and resilience patterns.
MACS is a training-free inference framework that mitigates the straggler effect in expert parallelism for multimodal MoE MLLMs by introducing entropy-weighted load and dynamic modality-adaptive capacity mechanisms.