Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
Summary
Federation of Experts (FoE) restructures mixture-of-experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks and improving inference throughput and latency by up to 5.2x while maintaining generation quality.
View Cached Full Text
Cached at: 05/15/26, 12:21 AM
Paper page - Federation of Experts: Communication Efficient Distributed Inference for Large Language Models
Source: https://huggingface.co/papers/2605.06206
Abstract
Federation of Experts restructures mixture of experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks while maintaining generation quality.
Mixture of expertshas emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of atransformer layerinto multipleMoE clusters. Each cluster is responsible for only one of theKV headsandexpert parallelismis applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drivesroutinganddispatchfor the next MoE block. In asingle-node setting, FoE completely eliminatesall-to-all communicationas all experts within a group are contained on the same GPU. Inmulti-node settings, FoE confinesall-to-all communicationto the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that onLongBench, FoE significantly improvesinference throughputandlatencyin both single-node andmulti-node settings, reducing the end-to-end forward-passlatencyby up to 5.2x,TTFTby 3.62x, andTBTby 1.95x. It does so while achieving comparable generation quality to amixture of expertsmodel of the same size and training configuration.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.06206
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06206 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06206 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06206 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs
Mix-MoE proposes a mixed Mixture-of-Experts framework with specialized expert groups and Fourier-transform-enhanced routing to mitigate parameter interference in multilingual machine translation, achieving significant improvements over baselines.
XPERT: Expert Knowledge Transfer for Effective Training of Language Models
The paper introduces XPERT, a framework that extracts and reuses expert knowledge from pre-trained Mixture-of-Experts (MoE) language models to improve training efficiency and performance in downstream models.
Less is MoE: Trimming Experts in Domain-Specialist Language Models
This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.
@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…
A new method called Zero-Expert Self-Distillation Adaptation (ZEDA) allows MoE models like Qwen3 and GLM to skip half their expert computations on easy tokens with minimal accuracy loss, achieving ~20% inference speedup by adding dummy experts that output nothing.
Mixture of Experts (MoEs) in Transformers
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.