DOT-MoE: Differentiable Optimal Transport for MoEfication
Summary
DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models that retain 90% of original performance while reducing active parameters by 50%.
View Cached Full Text
Cached at: 06/02/26, 07:33 PM
Paper page - DOT-MoE: Differentiable Optimal Transport for MoEfication
Source: https://huggingface.co/papers/2606.01666
Abstract
DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models with improved performance retention.
The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. WhileMixture of Experts(MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering orrandom splittingto partition theFeed-Forward Network(FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a DifferentiableOptimal Transport(DOT) problem. Instead of static heuristics, we modelneuron assignmentas a balanced transport problem, utilizingdifferentiable Sinkhorn-Knopp iterationsto enforce strictexpert capacity constraints. Furthermore, we utilizeStraight-Through Estimators(STE) to jointly learn the discrete neuron-to-expert assignment and thetoken-to-expert routingpolicy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperformsstructured pruning,heuristic clustering, and random-split baselines, retaining 90% of the original dense model’s performance while reducing active parameters by 50%.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.01666
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.01666 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.01666 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.01666 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MobileMoE: Scaling On-Device Mixture of Experts
MobileMoE introduces efficient on-device mixture-of-experts language models with sub-billion parameters, achieving better performance and efficiency than dense baselines and existing MoE models. The models are trained on open-source datasets and demonstrate significant speedups on commodity smartphones.
What is the point of MoE models, beyond being faster?
A discussion about the advantages of Mixture of Experts (MoE) models over dense models beyond speed, considering RAM constraints and scaling limits.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
This paper introduces DisagMoE, a system for MoE training that optimizes computation-communication overlap by disaggregating attention and FFN layers across GPU groups. Implemented on Megatron-LM, it achieves up to 1.8x speedup on H800 clusters by addressing inter-node communication bottlenecks.
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.
Post-Trained MoE Can Skip Half Experts via Self-Distillation
ZEDA is a low-cost framework that converts post-trained static MoE models into dynamic ones by injecting zero-output experts and using self-distillation, achieving over 50% expert FLOP reduction with marginal accuracy loss on benchmarks.