Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Hugging Face Daily Papers Papers

Summary

Federation of Experts (FoE) restructures mixture-of-experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks and improving inference throughput and latency by up to 5.2x while maintaining generation quality.

Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.
Original Article
View Cached Full Text

Cached at: 05/15/26, 12:21 AM

Paper page - Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Source: https://huggingface.co/papers/2605.06206

Abstract

Federation of Experts restructures mixture of experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks while maintaining generation quality.

Mixture of expertshas emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of atransformer layerinto multipleMoE clusters. Each cluster is responsible for only one of theKV headsandexpert parallelismis applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drivesroutinganddispatchfor the next MoE block. In asingle-node setting, FoE completely eliminatesall-to-all communicationas all experts within a group are contained on the same GPU. Inmulti-node settings, FoE confinesall-to-all communicationto the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that onLongBench, FoE significantly improvesinference throughputandlatencyin both single-node andmulti-node settings, reducing the end-to-end forward-passlatencyby up to 5.2x,TTFTby 3.62x, andTBTby 1.95x. It does so while achieving comparable generation quality to amixture of expertsmodel of the same size and training configuration.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.06206

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06206 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06206 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06206 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Less is MoE: Trimming Experts in Domain-Specialist Language Models

arXiv cs.LG

This paper introduces Fisher-MoE, a method that compresses Mixture-of-Experts models by trimming intermediate dimensions within FFN layers using Fisher importance, achieving 45% weight memory reduction and 21% throughput improvement without significant capability loss.

Mixture of Experts (MoEs) in Transformers

Hugging Face Blog

Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.