Tag
This paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training, proposing a two-stage framework that reduces hyperparameter search overhead by up to 90% while maintaining performance.
Expresses the opinion that too much effort is spent on making optimizers marginally faster, and the real need is for hyperparameter-free optimizers.