adamw

#adamw

@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…

X AI KOLs Following ↗ · 2026-05-22 Cached

This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.

0 favorites 0 likes

#adamw

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

arXiv cs.LG ↗ · 2026-05-20

This paper presents the first quantitative prediction of the grokking delay under AdamW, deriving a closed-form law and validating it on algorithmic tasks with high accuracy.

0 favorites 0 likes

#adamw

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

arXiv cs.AI ↗ · 2026-05-20 Cached

This paper introduces LBW-Guard, a bounded autonomous training control governance layer that operates above the AdamW optimizer to monitor telemetry and apply bounded control during training, demonstrating improved perplexity and training speed under stress conditions.

0 favorites 0 likes

adamw

@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

Submit Feedback