Tag
This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.
This paper presents the first quantitative prediction of the grokking delay under AdamW, deriving a closed-form law and validating it on algorithmic tasks with high accuracy.
This paper introduces LBW-Guard, a bounded autonomous training control governance layer that operates above the AdamW optimizer to monitor telemetry and apply bounded control during training, demonstrating improved perplexity and training speed under stress conditions.