@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…

X AI KOLs Following Papers

Summary

This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.

To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains most of µP's benefit. Last year, Hayou et al. found that µP's embedding LR rule is wrong for realistic LLM vocab sizes. They found that the optimal embedding
Original Article
View Cached Full Text

Cached at: 05/23/26, 02:09 PM

Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width

I’m no nanoGPT speedrunner, but isn’t it something people stumbled into by using Muon for hidden layers + Adam for the rest?

To clarify, this paper basically says: under AdamW, µP’s embedding LR rule (constant) is essentially right and explains most of µP’s benefit.

Last year, Hayou et al. found that µP’s embedding LR rule is wrong for realistic LLM vocab sizes. They found that the optimal embedding LR decreases as 1/√width

These predictions look contradictory. But this paper successfully tested its thesis in a regime (vocab=50k, width=128-2048) that shouldn’t work according to Hayou.

Not sure why this is the case tbh, but interesting future work to explore!

https://x.com/Ham_TheFog/status/2057617101360451886…

Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Barkeshli): https://arxiv.org/abs/2605.21486

Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size (Hayou and Liu): https://arxiv.org/abs/2506.15025

Similar Articles

Can LLMs Take Retrieved Information with a Grain of Salt?

arXiv cs.CL

This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.