Tag
DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models that retain 90% of original performance while reducing active parameters by 50%.
EMO is a Mixture-of-Experts model that enables modular deployment by grouping similar domain tokens with shared experts, achieving performance comparable to standard MoEs while allowing significant expert pruning (25% experts retain 99% performance) without performance degradation.
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.