DOT-MoE: Differentiable Optimal Transport for MoEfication

Hugging Face Daily Papers 06/01/26, 12:00 AM Papers

mixture-of-experts optimal-transport differentiable model-compression efficiency sparse-models

Summary

DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models that retain 90% of original performance while reducing active parameters by 50%.

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

Original Article

View Cached Full Text

Cached at: 06/02/26, 07:33 PM

Paper page - DOT-MoE: Differentiable Optimal Transport for MoEfication

Source: https://huggingface.co/papers/2606.01666

Abstract

DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models with improved performance retention.

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. WhileMixture of Experts(MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering orrandom splittingto partition theFeed-Forward Network(FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a DifferentiableOptimal Transport(DOT) problem. Instead of static heuristics, we modelneuron assignmentas a balanced transport problem, utilizingdifferentiable Sinkhorn-Knopp iterationsto enforce strictexpert capacity constraints. Furthermore, we utilizeStraight-Through Estimators(STE) to jointly learn the discrete neuron-to-expert assignment and thetoken-to-expert routingpolicy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperformsstructured pruning,heuristic clustering, and random-split baselines, retaining 90% of the original dense model’s performance while reducing active parameters by 50%.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.01666

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01666 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.01666 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01666 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

DOT-MoE: Differentiable Optimal Transport for MoEfication

Paper page - DOT-MoE: Differentiable Optimal Transport for MoEfication

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

MobileMoE: Scaling On-Device Mixture of Experts

What is the point of MoE models, beyond being faster?

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Submit Feedback

Similar Articles

MobileMoE: Scaling On-Device Mixture of Experts

What is the point of MoE models, beyond being faster?

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Post-Trained MoE Can Skip Half Experts via Self-Distillation