mu-parameterization

#mu-parameterization

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

X AI KOLs Following ↗ · 2026-05-22 Cached

This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.

0 favorites 0 likes

mu-parameterization

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

Submit Feedback