Tag
This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.