@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…
Summary
This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.
View Cached Full Text
Cached at: 05/23/26, 02:09 PM
Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width
I’m no nanoGPT speedrunner, but isn’t it something people stumbled into by using Muon for hidden layers + Adam for the rest?
To clarify, this paper basically says: under AdamW, µP’s embedding LR rule (constant) is essentially right and explains most of µP’s benefit.
Last year, Hayou et al. found that µP’s embedding LR rule is wrong for realistic LLM vocab sizes. They found that the optimal embedding LR decreases as 1/√width
These predictions look contradictory. But this paper successfully tested its thesis in a regime (vocab=50k, width=128-2048) that shouldn’t work according to Hayou.
Not sure why this is the case tbh, but interesting future work to explore!
https://x.com/Ham_TheFog/status/2057617101360451886…
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Barkeshli): https://arxiv.org/abs/2605.21486
Optimal Embedding Learning Rate in LLMs: The Effect of Vocabulary Size (Hayou and Liu): https://arxiv.org/abs/2506.15025
Similar Articles
@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…
This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
This paper investigates how LLMs' internal priors affect zero-shot annotation performance, finding that nearly two-thirds of errors resist prompt-based correction and introducing Definition-Specific Familiarity as a better predictor than memorization metrics.
Can LLMs Take Retrieved Information with a Grain of Salt?
This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.
@MatthieuWyart: LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs …
This paper proves that learning by predicting latent representations (as in world models like JEPA and data2vec) requires exponentially less data than predicting tokens (as in LLMs) for hierarchical data with hidden structure.
Learning, Fast and Slow: Towards LLMs That Adapt Continually [R]
This paper introduces a Fast-Slow Training framework for LLMs that combines parameter updates with optimized context to improve sample efficiency and reduce catastrophic forgetting during continual learning.