sparse-models

#sparse-models

DOT-MoE: Differentiable Optimal Transport for MoEfication

Hugging Face Daily Papers ↗ · 2026-06-01 Cached

DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models that retain 90% of original performance while reducing active parameters by 50%.

0 favorites 0 likes

#sparse-models

EMO: Pretraining Mixture of Experts for Emergent Modularity

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.

0 favorites 0 likes

#sparse-models

Mixture of Experts (MoEs) in Transformers

Hugging Face Blog ↗ · 2026-02-26 Cached

Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.

0 favorites 0 likes

sparse-models

DOT-MoE: Differentiable Optimal Transport for MoEfication

EMO: Pretraining Mixture of Experts for Emergent Modularity

Mixture of Experts (MoEs) in Transformers

Submit Feedback