learning-rate

#learning-rate

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

X AI KOLs Following ↗ · 2026-05-22 Cached

This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.

0 favorites 0 likes

#learning-rate

@maximelabonne: Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width I'm no nanoGP…

X AI KOLs Following ↗ · 2026-05-21 Cached

A tweet suggests that scaling the embedding learning rate by model width can replace the need for µP (micro-parameterization), referencing Muon optimizer for hidden layers and Adam for the rest.

0 favorites 0 likes

#learning-rate

Dropping learning rate fixed my Qlora fine-tune more than anything else i tried

Reddit r/LocalLLaMA ↗ · 2026-05-14

A user found that reducing the learning rate from 2e-4 to 1e-4 significantly improved QLoRA fine-tuning of Llama 3.1 8B on a small dataset (8k samples), preventing overfitting and leading to better evaluation results.

0 favorites 0 likes

#learning-rate

A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics

arXiv cs.LG ↗ · 2026-05-11 Cached

This paper derives a closed-form upper bound for admissible learning-rate steps in belief-space dynamics using KL divergence and Bregman geometry, focusing on cross-entropy classification.

0 favorites 0 likes

#learning-rate

A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

This paper presents a closed-form upper bound for admissible learning-rate steps in belief-space dynamics, providing a theoretical result for optimization in robotics or control.

0 favorites 0 likes

learning-rate

@maximelabonne: Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate (first screenshot, Kalra and Ba…

@maximelabonne: Turns out you never really needed µP, you just needed to scale the embedding learning rate by model width I'm no nanoGP…

Dropping learning rate fixed my Qlora fine-tune more than anything else i tried

A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics

A Closed-Form Upper Bound for Admissible Learning-Rate Steps in Belief-Space Dynamics

Submit Feedback