@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…

X AI KOLs Following 05/22/26, 10:00 AM Papers

Summary

This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.

To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains most of µP's benefit. Last year, Hayou et al. found that µP's embedding LR rule is wrong for realistic LLM vocab sizes. They found that the optimal embedding

Original Article

View Cached Full Text

Cached at: 05/23/26, 02:09 PM

Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width

I’m no nanoGPT speedrunner, but isn’t it something people stumbled into by using Muon for hidden layers + Adam for the rest?

To clarify, this paper basically says: under AdamW, µP’s embedding LR rule (constant) is essentially right and explains most of µP’s benefit.

Last year, Hayou et al. found that µP’s embedding LR rule is wrong for realistic LLM vocab sizes. They found that the optimal embedding LR decreases as 1/√width

These predictions look contradictory. But this paper successfully tested its thesis in a regime (vocab=50k, width=128-2048) that shouldn’t work according to Hayou.

Not sure why this is the case tbh, but interesting future work to explore!

https://x.com/Ham_TheFog/status/2057617101360451886…

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Barkeshli): https://arxiv.org/abs/2605.21486

Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size (Hayou and Liu): https://arxiv.org/abs/2506.15025

@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…

Similar Articles

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

@dair_ai: New paper on giving LLM agents experience that improves the weights and stays readable at the same time. Agent-experien…

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Can LLMs Take Retrieved Information with a Grain of Salt?

@MatthieuWyart: LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs …

Submit Feedback

Similar Articles

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

@dair_ai: New paper on giving LLM agents experience that improves the weights and stays readable at the same time. Agent-experien…

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Can LLMs Take Retrieved Information with a Grain of Salt?

@MatthieuWyart: LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs …