adamw

#adamw

Hidden Boundary Motion in Transformer Optimization: Function-Space Orthogonalization of Affine Weight and Bias Updates

arXiv cs.LG ↗ · 7h ago Cached

The paper identifies hidden boundary motion in affine layers of Transformers, where weight updates act as bias updates due to nonzero input mean, and proposes SBO-AdamW to orthogonalize shape and boundary components, yielding improved validation accuracy.

0 favorites 0 likes

#adamw

@burny_tech: some updates on the optimizer magic

X AI KOLs Timeline ↗ · 3d ago Cached

A new NVIDIA paper proposes higher-order optimizers like Muon and SOAP as more efficient alternatives to AdamW for large-scale LLM pretraining.

0 favorites 0 likes

#adamw

Reassessing Muon for Matrix Factorization

arXiv cs.LG ↗ · 2026-07-16 Cached

This paper evaluates the Muon optimizer on low-rank matrix factorization, finding it does not consistently outperform AdamW, challenging earlier claims about its advantages in large-scale deep learning.

0 favorites 0 likes

#adamw

@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

X AI KOLs Following ↗ · 2026-06-10 Cached

Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.

0 favorites 0 likes

#adamw

@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…

X AI KOLs Following ↗ · 2026-05-22 Cached

This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.

0 favorites 0 likes

#adamw

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

arXiv cs.LG ↗ · 2026-05-20

This paper presents the first quantitative prediction of the grokking delay under AdamW, deriving a closed-form law and validating it on algorithmic tasks with high accuracy.

0 favorites 0 likes

#adamw

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

arXiv cs.AI ↗ · 2026-05-20 Cached

This paper introduces LBW-Guard, a bounded autonomous training control governance layer that operates above the AdamW optimizer to monitor telemetry and apply bounded control during training, demonstrating improved perplexity and training speed under stress conditions.

0 favorites 0 likes

adamw

Hidden Boundary Motion in Transformer Optimization: Function-Space Orthogonalization of Affine Weight and Bias Updates

@burny_tech: some updates on the optimizer magic

Reassessing Muon for Matrix Factorization

@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

@maximelabonne: To clarify, this paper basically says: under AdamW, µP's embedding LR rule (constant) is essentially right and explains…

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

Submit Feedback