Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

Summary

Federation of Experts (FoE) restructures mixture-of-experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks and improving inference throughput and latency by up to 5.2x while maintaining generation quality.

Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.

Original Article

View Cached Full Text

Cached at: 05/15/26, 12:21 AM

Paper page - Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Source: https://huggingface.co/papers/2605.06206

Abstract

Federation of Experts restructures mixture of experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks while maintaining generation quality.

Mixture of expertshas emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of atransformer layerinto multipleMoE clusters. Each cluster is responsible for only one of theKV headsandexpert parallelismis applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drivesroutinganddispatchfor the next MoE block. In asingle-node setting, FoE completely eliminatesall-to-all communicationas all experts within a group are contained on the same GPU. Inmulti-node settings, FoE confinesall-to-all communicationto the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that onLongBench, FoE significantly improvesinference throughputandlatencyin both single-node andmulti-node settings, reducing the end-to-end forward-passlatencyby up to 5.2x,TTFTby 3.62x, andTBTby 1.95x. It does so while achieving comparable generation quality to amixture of expertsmodel of the same size and training configuration.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.06206

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06206 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06206 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06206 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Paper page - Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Less is MoE: Trimming Experts in Domain-Specialist Language Models

@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…

Mixture of Experts (MoEs) in Transformers

Submit Feedback

Similar Articles

Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs

XPERT: Expert Knowledge Transfer for Effective Training of Language Models

Less is MoE: Trimming Experts in Domain-Specialist Language Models

@rohanpaul_ai: A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of e…

Mixture of Experts (MoEs) in Transformers