Tag
This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.