@Jianlin_S: MoE (9): The Gate Normalization Debate https://kexue.fm/archives/11782
Summary
A blog post discussing the debate on gate normalization in Mixture of Experts (MoE) models.
Similar Articles
Mixture of Experts (MoEs) in Transformers
Hugging Face blog post explaining Mixture of Experts (MoEs) architecture in Transformers, covering the shift from dense to sparse models, weight loading optimizations, expert parallelism, and training techniques for MoE-based language models.
What is the point of MoE models, beyond being faster?
A discussion about the advantages of Mixture of Experts (MoE) models over dense models beyond speed, considering RAM constraints and scaling limits.
@jbhuang0604: Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’v…
The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.
Emergent Modularity in Mixture-of-Experts Models (8 minute read)
Ai2 releases EMO, a 14B-parameter mixture-of-experts language model trained to develop emergent modularity. It allows using a small subset of experts for specific tasks while maintaining near full-model performance.
Geometric Asymmetry in MoE Specialization: Functional Decorrelation and Representational Overlap
This paper introduces a Jacobian-PCA-Grassmann framework to analyze the geometric structure of expert specialization in Mixture-of-Experts (MoE) Transformers. It finds that experts exhibit strong functional decorrelation while their representations overlap, and that routing sparsity significantly influences this geometry.