Tag
This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.
A tweet suggests that scaling the embedding learning rate by model width can replace the need for µP (micro-parameterization), referencing Muon optimizer for hidden layers and Adam for the rest.
A user found that reducing the learning rate from 2e-4 to 1e-4 significantly improved QLoRA fine-tuning of Llama 3.1 8B on a small dataset (8k samples), preventing overfitting and leading to better evaluation results.
This paper derives a closed-form upper bound for admissible learning-rate steps in belief-space dynamics using KL divergence and Bregman geometry, focusing on cross-entropy classification.
This paper presents a closed-form upper bound for admissible learning-rate steps in belief-space dynamics, providing a theoretical result for optimization in robotics or control.