@Jianlin_S: MoE (9): 门控归一化之争

X AI KOLs Timeline 2026/06/17 08:34 论文

摘要

一篇讨论混合专家（MoE）模型中门控归一化之争的博客文章。

MoE (9): 门控归一化之争 https://kexue.fm/archives/11782

查看原文

相似文章

Transformer 中的专家混合模型 (MoEs)

Hugging Face Blog

Hugging Face 的博客文章，介绍 Transformer 中的专家混合模型 (MoEs) 架构，涵盖从密集模型到稀疏模型的转变、权重加载优化、专家并行计算以及基于 MoE 的语言模型训练技术。

除了更快之外，MoE 模型的意义何在？

Reddit r/LocalLLaMA

讨论混合专家（MoE）模型在速度之外相对于密集模型的优势，考虑内存限制和扩展限制。

@jbhuang0604: Huge! It’s amazing how often Noam’s papers end up at the center of the field. In many tutorial videos I’ve made, they’v…

X AI KOLs Following

The article provides a detailed explanation of Mixture of Experts (MoE) in transformers, covering routing, load balancing, and recent innovations like fine-grained experts. It also highlights the significance of Noam Shazeer's research contributions and his move from Google to OpenAI.

混合专家模型中的涌现式模块化（8 分钟阅读）

TLDR AI

Ai2 发布了 EMO，一个 14B 参数的混合专家语言模型，训练用于发展涌现式模块化。它允许在特定任务中使用一小部分专家，同时保持接近全模型性能。

MoE专业化中的几何不对称性：功能去相关与表示重叠

arXiv cs.LG

本文提出一个Jacobian-PCA-Grassmann框架，用于分析混合专家（MoE）Transformer中专家专业化的几何结构。研究发现，专家表现出强烈的功能去相关，而其表示存在重叠，并且路由稀疏性显著影响这一几何结构。