Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

arXiv cs.CL Papers

Summary

This paper discovers predictable scaling laws for optimal hyperparameters (learning rate, batch size) in LLM continued pre-training, proposing a two-stage framework that reduces hyperparameter search overhead by up to 90% while maintaining performance.

arXiv:2606.05610v1 Announce Type: new Abstract: The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textit{Empirical Law Discovery}, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textit{State-Aware Hyperparameter Prediction}, where we evaluate an initial checkpoint's validation loss and use the inverse scaling law to estimate its \textit{equivalent pre-training compute} -- the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90\% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:07 AM

# Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training
Source: [https://arxiv.org/html/2606.05610](https://arxiv.org/html/2606.05610)
Yongwei Zhou1∗, Juncheng Diao1,2∗, Junlin Shang3, Peiguang Li1, Rongxiang Weng1 1MeiTuan2University of Chinese Academy of Sciences3Harbin Institute of Technology ywzhouphd2018@gmail\.com diaojuncheng24@mails\.ucas\.ac\.cn

###### Abstract

The efficacy of continued pre\-training for Large Language Models \(LLMs\) hinges upon hyperparameter configurations, such as learning rate and batch size\. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs\. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre\-training process\. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint\. Our approach has two stages: \(1\)Empirical Law Discovery, where we train small\-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss\-compute scaling laws; and \(2\)State\-Aware Hyperparameter Prediction, where we evaluate an initial checkpoint’s validation loss and use the inverse scaling law to estimate itsequivalent pre\-training compute—the compute needed to achieve the same loss from scratch\. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run\. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90% while achieving comparable or superior performance relative to baselines\. This model\-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre\-training scenarios starting from any given point\.

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre\-training

Yongwei Zhou1∗, Juncheng Diao1,2∗, Junlin Shang3††thanks:Equal contribution\., Peiguang Li1, Rongxiang Weng11MeiTuan2University of Chinese Academy of Sciences3Harbin Institute of Technologyywzhouphd2018@gmail\.com diaojuncheng24@mails\.ucas\.ac\.cn

## 1Introduction

Continued pre\-training \(CPT\) is pivotal for adapting LLMs to domain\-specific applicationsGuptaet al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib9)\); Rozièreet al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib33)\); Azerbayevet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib34)\), offering a compute\-efficient alternative to training from scratch\. During this adaptation, LLMs encode complex knowledge structures, making them highly sensitive to hyperparameter configurationsKeet al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib13)\)\. Inappropriate learning rates \(L​RLR\) and batch sizes \(BB\) often lead to training instability or performance degradationWenet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib21)\); Luoet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib35)\)\. However, current practices still rely on heuristics or exhaustive grid searches, which are computationally expensive and lack principled guidanceYanget al\.\([2022](https://arxiv.org/html/2606.05610#bib.bib3)\); Defazioet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib25)\)\.

![Refer to caption](https://arxiv.org/html/2606.05610v1/x1.png)Figure 1:Scaling Laws for Optimal Hyperparameters in Continued Pre\-trainingWhile scaling laws provide such guidance for training from random initializationKaplanet al\.\([2020](https://arxiv.org/html/2606.05610#bib.bib1)\); Hernandezet al\.\([2021](https://arxiv.org/html/2606.05610#bib.bib48)\), their application to CPT remains largely under\-explored\. Deriving scaling laws for CPT presents two core challenges: \(1\) it is unclear whether predictable relationships exist between model state and optimal hyperparameters when training does not start from scratchGuptaet al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib9)\); Ibrahimet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib10)\); and \(2\) distribution shifts between pre\-training and target data complicate quantifying the effective contribution of the initial checkpoint to the training processGadre and others \([2024](https://arxiv.org/html/2606.05610#bib.bib11)\); Xieet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib12)\)\. In this paper, we first empirically discover that optimal hyperparameters in CPT indeed follow stable and predictable scaling laws\. Leveraging this insight, we address the hyperparameter selection challenge by establishing a framework for zero\-shot prediction across a given checkpoint, quantifying the relationship between compute budget and optimal configurations\.

Addressing Challenge 1: Discovering Hyperparameter Scaling Laws\.To investigate the existence of these laws, we systematically train small\-scale proxy models across varied scales and hyperparameter configurations on CPT data\. Our experiments reveal consistent hyperparameter scaling laws \(Fig\.[1](https://arxiv.org/html/2606.05610#S1.F1)\): optimal batch size increases monotonically with compute budget, while optimal learning rate decreases accordingly\. This empirical discovery confirms that CPT maintains stable scaling relationships—distinct from scratch\-training laws—that account for the prior knowledge encoded in initial checkpoints\. Based on these observations, we derive mapping functions that relate compute budget directly to optimalL​RLRandBBvia standard loss\-compute scaling laws\.

Addressing Challenge 2: Quantifying the Initial Training State\.To apply these laws to a arbitrary checkpoints, we introduce the concept ofequivalent pre\-training compute\(CpreC\_\{\\text\{pre\}\}\)\. Unlike training from scratch, CPT starts from a non\-zero optimization state that has historically been difficult to quantify\. We propose that any checkpoint can be situated as a specific coordinate on a continuous training trajectory\. By mapping a model’s validation loss to a compute\-equivalent value on the target domain,CpreC\_\{\\text\{pre\}\}formally quantifies itsinitial training staterelative to the new task\. This transforms the initial checkpoint from a “black box” into a measurable starting state, where the total effective compute is defined asCtotal=Cpre\+CcptC\_\{\\text\{total\}\}=C\_\{\\text\{pre\}\}\+C\_\{\\text\{cpt\}\}, whereCcptC\_\{\\text\{cpt\}\}is the planned CPT budget\. By substitutingCtotalC\_\{\\text\{total\}\}into our derived hyperparameter scaling laws \(Eqs\.[3](https://arxiv.org/html/2606.05610#S2.E3)and[4](https://arxiv.org/html/2606.05610#S2.E4)\), we achieve zero\-shot prediction of the optimal learning rate and batch size for the target run\.

In summary, our main contributions are threefold: \(1\) We establish scaling laws for optimal hyperparameters in continued pre\-training, quantitatively relating compute budget to optimal learning rate and batch size, offering a principled alternative to heuristics and grid search\. \(2\) We proposeequivalent pre\-training computeto formally quantify the training state inherited from initial checkpoints\. This metric transforms an arbitrary starting point into a measurable coordinate, enabling reliable hyperparameter prediction for models at any stage of their lifecycle\. \(3\) We empirically validate our framework on models up to Dense\-8B and MoE\-3B parameters, demonstrating that it reduces computational costs by 70–90% compared to traditional grid\-search baselines while consistently improving training stability and performance\.

![Refer to caption](https://arxiv.org/html/2606.05610v1/x2.png)Figure 2:The Overall Framework for Optimal Hyperparameters Prediction in continued pre\-training
## 2Method

### 2\.1Problem Definition and Challenges

This paper investigates the existence ofhyperparameter scaling lawsthat quantitatively relate the compute budget to the optimal hyperparameter configurations in continued pre\-training \(CPT\)\. Establishing such laws would provide a principled framework for hyperparameter selection at any stage of a model’s lifecycle, replacing computationally expensive heuristics\. However, formulating these laws requires overcoming two fundamental challenges:

- •Existence and Functional Characterization:It remains theoretically and empirically unverified whether predictable, monotonic relationships exist between the compute budget \(or validation loss\) and optimal hyperparameters when training does not originate from a random initialization\. If such relationships exist, identifying their specific functional forms \(e\.g\., power laws\) and ensuring their stability across different model scales and architectures is a non\-trivial task\.
- •Quantifying the Initial Optimization State:Unlike training from scratch, CPT begins from a non\-zero optimization state inherited from a pre\-trained checkpoint\. Given that pre\-training and CPT typically involve distinct data distributions, the initial checkpoint represents a “black box” in terms of its progress on the target domain\. A core challenge lies in precisely situating this checkpoint as a measurable coordinate on a continuous training trajectory—specifically, quantifying its inherited knowledge as an equivalent compute budget relative to the target data\.

### 2\.2Overall Framework

To address the challenges of non\-zero initialization and distribution shift, we propose a two\-stage framework that models the evolution of optimal hyperparameters along a continuous optimization trajectory \(Fig\.[2](https://arxiv.org/html/2606.05610#S1.F2)\)\. This framework provides a unified approach to hyperparameter transfer, transforming the “black\-box” initial state of any checkpoint into a quantifiable computational coordinate\.

Stage 1: Empirical Law Discovery\.We first characterize the scaling behavior by training small\-scale proxy models of varying sizesNNunder diverse hyperparameter configurations \(L​RLR,BB\) on the target CPT data\. By analyzing the optimal hyperparameter configurations of different validation loss, we establish two foundational components: \(1\) Hyperparameter\-Loss Mappings: The functional relationships between optimal hyperparameters and the achieved validation loss, denoted asBopt=f​\(L\)B\_\{\\text\{opt\}\}=f\(L\)andL​Ropt=g​\(L\)LR\_\{\\text\{opt\}\}=g\(L\); \(2\) Loss\-Compute Scaling Laws: The standard power\-law relationshipL​\(C\)L\(C\)that describes how validation loss scales with compute budgetCCon the target distribution\.

Stage 2: State\-Aware Hyperparameter Prediction\.Given a pre\-trained checkpointMθ0M\_\{\\theta\_\{0\}\}with an initial validation lossLinitL\_\{\\text\{init\}\}on the target domain, we perform a zero\-shot prediction of the optimal CPT configurations\. First, we utilize the inverse loss\-compute scaling law to estimate theequivalent pre\-training computeCpre=L−1​\(Linit\)C\_\{\\text\{pre\}\}=L^\{\-1\}\(L\_\{\\text\{init\}\}\)\. This step effectively situates the checkpoint as a specific coordinate on the training trajectory\. Second, we define thetotal effective computeasCtotal=Cpre\+CcptC\_\{\\text\{total\}\}=C\_\{\\text\{pre\}\}\+C\_\{\\text\{cpt\}\}, whereCcptC\_\{\\text\{cpt\}\}is the planned CPT budget\. Finally, the optimal hyperparameters for the target run are predicted by evaluatingf​\(Ctotal\)f\(C\_\{\\text\{total\}\}\)andg​\(Ctotal\)g\(C\_\{\\text\{total\}\}\)\(or equivalentlyf​\(Lt​a​r​g​e​t\)f\(L\_\{target\}\)andg​\(Lt​a​r​g​e​t\)g\(L\_\{target\}\)\)\.

### 2\.3Empirical Law Discovery

This stage establishes the foundational mappings between validation lossLLand the optimal hyperparameter configurations:Bopt​\(L\)B\_\{\\text\{opt\}\}\(L\)andL​Ropt​\(L\)LR\_\{\\text\{opt\}\}\(L\)\. Since lossLLitself follows a predictable scaling law with compute budgetCC\(i\.e\.,L=𝒮​\(C\)L=\\mathcal\{S\}\(C\)\), these relationships implicitly define the trajectory of optimal hyperparameters as compute scales:Bopt=f​\(𝒮​\(C\)\)B\_\{\\text\{opt\}\}=f\(\\mathcal\{S\}\(C\)\)andL​Ropt=g​\(𝒮​\(C\)\)LR\_\{\\text\{opt\}\}=g\(\\mathcal\{S\}\(C\)\)\.

#### Definition of Optimality\.

For a given model scale and dataset, we define the optimal configuration\(Bopt,L​Ropt\)\(B\_\{\\text\{opt\}\},LR\_\{\\text\{opt\}\}\)for a target lossLLas the one that reaches the target loss with the minimum training compute\. Formally, among all configurationsℋ=\{\(B,L​R\)\}\\mathcal\{H\}=\\\{\(B,LR\)\\\}that achieve a validation loss≤L\\leq L,\(Bopt,L​Ropt\)\(B\_\{\\text\{opt\}\},LR\_\{\\text\{opt\}\}\)is optimal if it minimizes the total computeCC\(measured in FLOPs\), thereby identifying the most compute\-efficient path to a specific optimization state\.

#### Empirical Data Collection via Proxy Models\.

To discover these laws, we utilize small\-scale proxy models \(N∈\{100​M,500​M\}N\\in\\\{100\\text\{M\},500\\text\{M\}\\\}\) trained on the target CPT dataset\. For each scale, we execute an extensive grid search overBBandL​RLR, monitoring validation loss across a multi\-domain benchmark \(General, Math, Code\)\. The resulting loss\-versus\-compute trajectories for each\(N,B,L​R\)\(N,B,LR\)triplet provide the empirical basis for our scaling analysis\.

#### Three\-Step Derivation Process\.

We derive the mapping function through the following steps:

Step 1: Constructing Iso\-Loss Curves\.For a target loss levelL^\\hat\{L\}, we identify the required data tokensD^\\hat\{D\}for each hyperparameter configuration\. By plotting these estimates in the\(B,L​R\)\(B,LR\)plane, we construct iso\-loss curves as illustrated in Fig\.[3](https://arxiv.org/html/2606.05610#S3.F3), where each point on a curve corresponds to the number of data tokensD^\\hat\{D\}required to reach the target lossL^\\hat\{L\}on a specific hyperparameter configuration\. The vertex of each curve—representing the minimumD^\\hat\{D\}to reachL^\\hat\{L\}—uniquely identifies the optimal pair\(Bopt​\(L^\),L​Ropt​\(L^\)\)\(B\_\{\\text\{opt\}\}\(\\hat\{L\}\),LR\_\{\\text\{opt\}\}\(\\hat\{L\}\)\)\.

Step 2: Functional Fitting\.We observe that the relationship between loss and optimal hyperparameters follows a power\-law trend\. We fit the collected pairs\{\(L^i,Bopt,i\)\}\\\{\(\\hat\{L\}\_\{i\},B\_\{\\text\{opt\},i\}\)\\\}and\{\(L^i,L​Ropt,i\)\}\\\{\(\\hat\{L\}\_\{i\},LR\_\{\\text\{opt\},i\}\)\\\}to the following parametric forms:

Bopt​\(L\)=f​\(L\)andL​Ropt​\(L\)=g​\(L\)\.B\_\{\\text\{opt\}\}\(L\)=f\(L\)\\quad\\text\{and\}\\quad LR\_\{\\text\{opt\}\}\(L\)=g\(L\)\.\(1\)
Step 3: Composition with Compute Scaling Laws\.FollowingKaplanet al\.\([2020](https://arxiv.org/html/2606.05610#bib.bib1)\), we model the loss\-compute relationship as:

L​\(C\)=L0\+α⋅C−γ,L\(C\)=L\_\{0\}\+\\alpha\\cdot C^\{\-\\gamma\},\(2\)whereL0L\_\{0\}is the irreducible loss, andγ\\gammais the scaling exponent\. By substituting Eq\.[2](https://arxiv.org/html/2606.05610#S2.E2)into our fitted functions, we obtain the final Compute\-to\-Hyperparameter Scaling Laws:

Bopt​\(C\)\\displaystyle B\_\{\\text\{opt\}\}\(C\)=f​\(L0\+α⋅C−γ\),\\displaystyle=f\(L\_\{0\}\+\\alpha\\cdot C^\{\-\\gamma\}\),\(3\)L​Ropt​\(C\)\\displaystyle LR\_\{\\text\{opt\}\}\(C\)=g​\(L0\+α⋅C−γ\)\.\\displaystyle=g\(L\_\{0\}\+\\alpha\\cdot C^\{\-\\gamma\}\)\.\(4\)
These laws enable zero\-shot prediction of optimal hyperparameters for any target compute budgetCtotalC\_\{\\text\{total\}\}, circumventing the need for further tuning\.

### 2\.4State\-Aware Hyperparameter Prediction

Given an initial pre\-trained checkpointMθ0M\_\{\\theta\_\{0\}\}and a target CPT compute budgetCcptC\_\{\\text\{cpt\}\}, this stage predicts the optimal hyperparameters by situating the checkpoint on the previously derived scaling trajectory\. The core challenge lies in quantifying the optimization state of a pre\-trained model when it is introduced to a new data distribution\.

Our framework rests on the hypothesis that a pre\-trained model’s state on a target domain can be uniquely characterized by anequivalent pre\-training compute\(CpreC\_\{\\text\{pre\}\}\)\. We defineCpreC\_\{\\text\{pre\}\}as the theoretical compute budget required to reach the model’s current validation loss if it had been trained from a random initialization on the target data\. This hypothesis allows us to treat CPT not as a disjoint training phase, but as a continuation of a single, contiguous optimization path\. We empirically validate this hypothesis across diverse domain\-shift scenarios in Section[3](https://arxiv.org/html/2606.05610#S3)\.

Step 1: Projecting the Initial State\.We first evaluate the checkpoint’s initial validation lossLinitL\_\{\\text\{init\}\}on the target CPT data\. By inverting the loss\-compute scaling lawL​\(C\)=L0\+α​C−γL\(C\)=L\_\{0\}\+\\alpha C^\{\-\\gamma\}established in Section[2\.3](https://arxiv.org/html/2606.05610#S2.SS3), we calculate theEquivalent Compute:

Cpre=\(αLinit−L0\)1/γ\.C\_\{\\text\{pre\}\}=\\left\(\\frac\{\\alpha\}\{L\_\{\\text\{init\}\}\-L\_\{0\}\}\\right\)^\{1/\\gamma\}\.\(5\)This step effectively projects the "black\-box" checkpoint onto the target domain’s compute\-loss coordinate system\.

Step 2: Total Effective Computation\.The total effective compute budget for the CPT phase is defined as the sum of the inherited and planned compute:

Ctotal=Cpre\+Ccpt\.C\_\{\\text\{total\}\}=C\_\{\\text\{pre\}\}\+C\_\{\\text\{cpt\}\}\.\(6\)This formulation re\-centers the CPT process onto the global scaling curve, whereCtotalC\_\{\\text\{total\}\}represents the terminal point of the intended training trajectory\.

Step 3: Optimal Hyperparameter Prediction\.We first predict the terminal validation lossLtargetL\_\{\\text\{target\}\}at the end of the CPT run:Ltarget=L​\(Ctotal\)L\_\{\\text\{target\}\}=L\(C\_\{\\text\{total\}\}\)\. The optimal hyperparameters are then predicted by evaluating the optimality functions at the predicted state:

Bopt\\displaystyle B\_\{\\text\{opt\}\}=f​\(L0\+α⋅Ctotal−γ\),\\displaystyle=f\\left\(L\_\{0\}\+\\alpha\\cdot C\_\{\\text\{total\}\}^\{\-\\gamma\}\\right\),\(7\)L​Ropt\\displaystyle LR\_\{\\text\{opt\}\}=g​\(L0\+α⋅Ctotal−γ\)\.\\displaystyle=g\\left\(L\_\{0\}\+\\alpha\\cdot C\_\{\\text\{total\}\}^\{\-\\gamma\}\\right\)\.\(8\)
The complete procedure is summarized in Algorithm[1](https://arxiv.org/html/2606.05610#alg1)\(Appendix[D](https://arxiv.org/html/2606.05610#A4)\)\. This approach provides a principled, zero\-shot mechanism for hyperparameter selection, directly addressing the non\-zero initialization challenge via theCpreC\_\{\\text\{pre\}\}formulation\.

## 3Experiments

We evaluate the proposed framework through a series of large\-scale experiments designed to answer three key research questions:RQ1 \(Predictability\):Do optimal hyperparameters in CPT follow stable scaling relationships across different scales?RQ2 \(Accuracy\):Can theEquivalent ComputeCpreC\_\{\\text\{pre\}\}accurately quantify the initial state of a given checkpoint under domain shift?RQ3 \(Efficiency\):Does our framework consistently outperform heuristic baselines while reducing the total search cost?

### 3\.1Datasets and Evaluation Protocol

To test the robustness of our framework across diverse data distributions, we curate a heterogeneous CPT corpus spanning three distinct domains: general knowledge, mathematics, and formal code\.

#### Training Corpus\.

As shown in Table[1](https://arxiv.org/html/2606.05610#S3.T1), we sample approximately 55B tokens from high\-quality open\-source repositories: SlimPajamaWeberet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib36)\)for general linguistic knowledge, OpenWebMathPasteret al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib37)\)for mathematical reasoning, and GitHub for programming logic\. To ensure unbiased evaluation, we construct a held\-out validation set by randomly sampling 0\.5% of documents from each domain\.

Table 1:Composition of the 55B Token Continued Pre\-training Corpus\.
#### Evaluation Benchmarks\.

To comprehensively evaluate model performance and the effectiveness of the predicted hyperparameters, we use a diverse suite of downstream benchmarks\.\(1\) Knowledge:MMLUHendryckset al\.\([2020](https://arxiv.org/html/2606.05610#bib.bib38)\)and MMLU\-ProWanget al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib39)\)for general world knowledge\.\(2\) Mathematical Reasoning:GSM8KCobbeet al\.\([2021](https://arxiv.org/html/2606.05610#bib.bib40)\)and MathQAAminiet al\.\([2019](https://arxiv.org/html/2606.05610#bib.bib41)\)for multi\-step arithmetic and mathematical problem\-solving\.\(3\) Commonsense Reasoning:HellaSwagZellerset al\.\([2019](https://arxiv.org/html/2606.05610#bib.bib42)\), PIQABisket al\.\([2020](https://arxiv.org/html/2606.05610#bib.bib43)\), WinoGrandeSakaguchiet al\.\([2021](https://arxiv.org/html/2606.05610#bib.bib44)\), and OpenBookQAMihaylovet al\.\([2018](https://arxiv.org/html/2606.05610#bib.bib49)\)for physical and logical commonsense\.\(4\) Code Generation:HumanEvalChenet al\.\([2021](https://arxiv.org/html/2606.05610#bib.bib46)\)and LiveCodeBenchJainet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib50)\)to evaluate functional accuracy in code generation\.

![Refer to caption](https://arxiv.org/html/2606.05610v1/x3.png)Figure 3:Optimization analysis of learning rate and batch size

### 3\.2Settings for Empirical Law Discovery

In this phase, we investigate the relationship between compute budget and optimal hyperparameters—specifically learning rate \(L​RLR\) and batch size \(BB\)—using the CPT data described in Section[3\.1](https://arxiv.org/html/2606.05610#S3.SS1)\. We evaluate both dense and Mixture\-of\-Experts \(MoE\) architectures; detailed configurations are provided in Appendix[A](https://arxiv.org/html/2606.05610#A1)\.

For both architectures, models are trained at two distinct scales:N∈\{100​M,500​M\}N\\in\\\{100\\text\{M\},500\\text\{M\}\\\}, whereNNdenotes the number of activated parameters during forward propagation\. To identify optimal hyperparameter configurations at these scales, we conduct an exhaustive grid search for each of the four model variants: Dense\-100M, Dense\-500M, MoE\-100M, and MoE\-500M\. The search space is defined by the following grid:B∈\{64,128,256\}B\\in\\\{64,128,256\\\}andL​R∈\{1×10−4,2×10−4,5×10−4\}LR\\in\\\{1\\times 10^\{\-4\},2\\times 10^\{\-4\},5\\times 10^\{\-4\}\\\}\. All experiments employ the AdamW optimizerLoshchilov and Hutter \([2017](https://arxiv.org/html/2606.05610#bib.bib47)\)with its default hyperparameters\.

As shown in Fig\.[3](https://arxiv.org/html/2606.05610#S3.F3), the optimalL​RLRtends to decrease as the compute budget increases, whereas the optimalBBexhibits a corresponding increasing trend\. These relationships are captured quantitatively via the scaling laws derived earlier and are illustrated in Fig\.[1](https://arxiv.org/html/2606.05610#S1.F1)\. Furthermore, Appendix[B](https://arxiv.org/html/2606.05610#A2)details the empirical evolution of these hyperparameters, demonstrating why largerBBand smallerL​RLRare necessary near convergence\.

### 3\.3Validation Settings for Continued Pre\-training

To evaluate the predictive accuracy of our framework, we employ two large\-scale pre\-trained checkpoints as starting points: a Dense\-8B model \(pre\-trained on 6T tokens\) and an MoE\-3B model \(pre\-trained on 5T tokens\)\.

As detailed in Table[2](https://arxiv.org/html/2606.05610#S3.T2), we first estimate the equivalent pre\-training compute \(CpreC\_\{\\text\{pre\}\}\) for each checkpoint by evaluating its initial validation loss on the target CPT corpus\. This value is then aggregated with the planned compute budget for the 55B\-token CPT phase \(CcptC\_\{\\text\{cpt\}\}\)\. The resulting total effective compute budget,Ctotal=Cpre\+CcptC\_\{\\text\{total\}\}=C\_\{\\text\{pre\}\}\+C\_\{\\text\{cpt\}\}, is subsequently used to predict the optimal hyperparameters for each CPT run, leveraging the scaling laws established during the hyperparameter prediction phase \(Section[2\.3](https://arxiv.org/html/2606.05610#S2.SS3)\)\. Detailed quantitative results and the specific parameters of these scaling laws are provided in Appendix[C](https://arxiv.org/html/2606.05610#A3)\.

Table 2:Experimental setup and predicted optimal ranges for continued pre\-training\.Table 3:Ablation results on MoE\-3B and Dense\-8B comparing our predicted setting against variations inL​RLRandBB\. Best results are inbold, and second\-best areunderlined\. We exclude the LiveCodeBench results for Dense\-8B, as the benchmark’s high difficulty leads to marginal performance and limited discriminative power\.Table 4:Comparison of computational costs for hyperparameter search\.
### 3\.4Results Analysis

Applying our prediction protocol, the derived mapping functions yield the following optimal configurations:\{L​R=5×10−5,B=768\}\\\{LR=5\\times 10^\{\-5\},B=768\\\}for Dense\-8B and\{L​R=1×10−4,B=1600\}\\\{LR=1\\times 10^\{\-4\},B=1600\\\}for MoE\-3B\. To rigorously validate the optimality of these predictions, we systematically decouple the two hyperparameters to conduct controlled ablations\. As shown in Table[3](https://arxiv.org/html/2606.05610#S3.T3), models trained with our predicted configurations consistently achieve peak average scores \(51\.3 and 57\.9, respectively\) compared to all off\-target settings, empirically validating the framework’s effectiveness across diverse architectures\.

#### Impact of Learning Rate at Fixed Batch Size\.

With the batch size fixed at the predicted optimum, we observe that the learning rate is critical for balancing knowledge adaptation and stability\. An excessively highL​RLR\(5×10−45\\times 10^\{\-4\}\) leads to a sharp performance decline to 47\.6 for Dense\-8B and 50\.8 for MoE\-3B\. This collapse, particularly pronounced in reasoning tasks such as GSM8K, suggests that excessive step sizes may disrupt pre\-trained weight structures and trigger catastrophic forgetting\. Conversely, smallerL​RLR\(e\.g\.,3×10−53\\times 10^\{\-5\}for Dense\-8B\) yield suboptimal results \(51\.0\), as they constrain the efficiency of new knowledge acquisition within the fixed token budget\. Our predictedL​RLRoptimally balances preserved prior capacity against full domain knowledge learning\.

#### Impact of Batch Size at Fixed Learning Rate\.

Varying theBBwhile fixing the optimalL​RLRreveals a distinct inverted U\-shaped performance curve\. Smaller batch sizes \(256 for Dense\-8B, 512 for MoE\-3B\) suffer from high gradient variance and unstable optimization, resulting in lower scores \(51\.0 and 56\.4\)\. However, increasingBBbeyond the predicted optimum is not always beneficial\. Given a fixed compute budget, an excessively largeBB\(1600 for Dense\-8B, 3200 for MoE\-3B\) significantly reduces the total number of gradient updates, hindering the model’s convergence on the target distribution\. This results in suboptimal scores of 51\.1 and 57\.7\. Our proposedBBbalances gradient fidelity and update frequency to maximize model performance\.

### 3\.5Computational Efficiency Analysis

Compared to traditional grid searches conducted directly on large\-scale target models, our framework significantly enhances computational efficiency by leveraging small\-scale proxy models\. As shown in Table[4](https://arxiv.org/html/2606.05610#S3.T4), this approach reduces the total compute budget by 92\.0% for Dense\-8B and 73\.1% for MoE\-3B\. Specifically, a traditional grid search evaluating 9 configurations on the target models would require1\.02×10221\.02\\times 10^\{22\}FLOPs for Dense\-8B and3\.03×10213\.03\\times 10^\{21\}FLOPs for MoE\-3B\. In contrast, our framework only performs these evaluations on 100M and 500M proxy models\. Due to the substantially lower compute requirements of these proxies, our total search budget is merely8\.15×10208\.15\\times 10^\{20\}FLOPs\. Because the search cost is largely decoupled from the target model’s scale, this efficiency advantage becomes increasingly pronounced as the size of the target LLM grows\.

### 3\.6Ablation Study

To validate the necessity of theEquivalent Compute\(CpreC\_\{\\text\{pre\}\}\) mechanism, we conduct ablation studies on Dense\-8B and MoE\-3B models, comparing our method \(Ctotal=Cpre\+CcptC\_\{\\text\{total\}\}=C\_\{\\text\{pre\}\}\+C\_\{\\text\{cpt\}\}\) against two variants:

- •Variant A: Ignoring the Pre\-training State \(Ctotal=CcptC\_\{\\text\{total\}\}=C\_\{\\text\{cpt\}\}\)\.This variant treats CPT as an independent from\-scratch training process, ignoring the model’s initial state and prior compute\.
- •Variant B: Using Raw Pre\-training Tokens\.This variant acknowledges pre\-training history but ignores distribution shifts\.CpreC\_\{\\text\{pre\}\}is set to the theoretical compute of the original pre\-training tokens, typically resulting in a significantly inflatedCtotalC\_\{\\text\{total\}\}\.

Table 5:Ablation study on scaling strategies\. Scores represent normalized category\-averaged performance\.As shown in Table[5](https://arxiv.org/html/2606.05610#S3.T5), the study reveals two key findings:\(1\) Necessity of Accounting for the Initial State\.IgnoringCpreC\_\{\\text\{pre\}\}\(Variant A\) misinterprets CPT as an early\-stage training process, leading to an aggressively high learning rate and a smaller batch size\. Consequently, the average scores for Dense\-8B and MoE\-3B drop by0\.90\.9and1\.71\.7points, respectively\. The resulting training instability and initial loss divergence confirm that the pre\-trained state must be explicitly accounted for to prevent disrupting the pre\-trained weight structure\.\(2\) Superiority ofCpreC\_\{\\text\{pre\}\}over Raw Token Counts\.Using raw token counts \(Variant B\) inflatesCtotalC\_\{\\text\{total\}\}, causing the framework to predict an overly conservativeL​RLRand a largeBBby incorrectly assuming the model is near convergence\. Our proposed method outperforms Variant B on both Dense\-8B \(51\.351\.3vs\.51\.151\.1\) and MoE\-3B \(56\.956\.9vs\.56\.856\.8\)\. This demonstrates thatCpreC\_\{\\text\{pre\}\}effectively calibrates the model’s knowledge on the target domain, bridging the gap introduced by distribution shifts without requiring costly heuristic tuning\. These results confirm that neither the CPT budget alone nor the original pre\-training scale is sufficient for reliable hyperparameter prediction\. Instead,CpreC\_\{\\text\{pre\}\}provides a robust metric for determining the optimal optimization trajectory during domain adaptation\.

## 4Related Work

#### Scaling Laws

Scaling laws characterize quantitative relationships between model performance and computational resources\. Early work established power\-law dependencies of validation loss on model size, dataset size, and compute budgetKaplanet al\.\([2020](https://arxiv.org/html/2606.05610#bib.bib1)\); Hoffmannet al\.\([2022](https://arxiv.org/html/2606.05610#bib.bib2)\)\. Recent studies have extended such laws to hyperparameters, showing that the optimal learning rate follows power laws with respect to model size and data volume, while the optimal batch size is governed by data scaleLiet al\.\([2025](https://arxiv.org/html/2606.05610#bib.bib18)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.05610#bib.bib19)\); Shuai and others \([2024](https://arxiv.org/html/2606.05610#bib.bib20)\)\. The critical batch size \(CBS\) is known to scale with training data volumeMcCandlishet al\.\([2018](https://arxiv.org/html/2606.05610#bib.bib28)\); Zhanget al\.\([2025](https://arxiv.org/html/2606.05610#bib.bib19)\)\. However, existing hyperparameter scaling laws focus almost exclusively on scratch training, with continued pre\-training underexplored\.

#### Continued Pre\-training

Continued pre\-training adapts foundation models to target domains while balancing new knowledge acquisition and original capability retentionGuptaet al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib9)\); Luoet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib35)\)\. Practical techniques such as learning rate rewarming, re\-decay, and data replay can help approach the performance of training from scratchIbrahimet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib10)\); Parmar and others \([2024](https://arxiv.org/html/2606.05610#bib.bib14)\)\. Warmup\-Stable\-Decay \(WSD\) schedules stabilize training and mitigate oscillations in flat loss landscapesHuet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib17)\); Wenet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib21)\); Hägeleet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib6)\)\. Despite these advances, hyperparameter selection \(e\.g\., learning rate, batch size\) still relies heavily on heuristic tuning, and a principled, predictive theory for hyperparameter scaling in continued pre\-training remains missing\.

#### Hyperparameter Optimization

Hyperparameter optimization using scaling\-based parameterization \(μ\\muP\) enables zero\-shot transfer across model scales via principled parameterizationYanget al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib4)\); Deyet al\.\([2023](https://arxiv.org/html/2606.05610#bib.bib30)\)\. For batch size, gradient\-noise\-scale\-based CBS estimation relies on strong assumptions, while direct measurement through warmup steps offers greater practicalityMcCandlishet al\.\([2018](https://arxiv.org/html/2606.05610#bib.bib28)\); Bergsma and others \([2025](https://arxiv.org/html/2606.05610#bib.bib23)\)\. Recent work also identifies power\-law relations among learning rate, batch size, and computationShenet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib24)\); Everettet al\.\([2024](https://arxiv.org/html/2606.05610#bib.bib29)\)\. Despite progress, two critical gaps remain: existing methods target training from scratch rather than continued pre\-training, andμ\\muP enables scale transfer but not dynamic hyperparameter adjustment from a given checkpoint\. To bridge these gaps, we introduceEquivalent Computeto quantify the effective computational contribution of prior training, allowing to predict optimal hyperparameters for continued pre\-training starting from any given checkpoint\.

## 5Conclusion

In this work, we have presented a novel framework for establishing hyperparameter scaling laws in the context of LLMs continued pre\-training\. By systematically analyzing the optimization trajectories of small\-scale proxy models, we identified predictable relationships between the total compute budget and optimal configurations for learning rate and batch size\. To address the challenge of non\-zero initialization, we introduced the concept ofEquivalent Compute, which quantifies a pre\-trained model’s optimization state relative to a target domain\. This formulation enables the direct prediction of optimal hyperparameters for a given checkpoint and compute budgets without requiring any trial runs on the target models\. Extensive experiments on models up to 8B parameters demonstrate that our framework significantly improves training stability and final performance while reducing computational search costs by up to 92%\. Overall, our findings offer a principled and scalable methodology for efficient domain adaptation, eliminating the need for exhaustive and costly grid searches\.

## Limitations

Despite the effectiveness of our framework, several limitations remain for future investigation: Verification on Ultra\-Large Scales: Due to the immense computational resources required, our empirical validation was primarily conducted on models up to Dense\-8B and MoE\-3B\. While the observed scaling trends are robust at these scales, further research is needed to verify whether specific power\-law constants remain invariant or require re\-calibration for ultra\-large\-scale models \(e\.g\., 70B\+ parameters\), where emergent properties or training instabilities might occur\. Domain\-Specific Scaling Sensitivity: Our framework utilizes a unified scaling law fitted on a representative data mixture\. However, different data distributions \(e\.g\., formal code vs\. mathematical reasoning\) may exhibit slightly different scaling exponents \(γ\\gamma\) and irreducible loss floors \(L0L\_\{0\}\)\. While our mixture\-based approach demonstrates strong generalizability across the evaluated domains, applying the framework to an extremely specialized or narrow domain might necessitate a localized calibration of the scaling parameters to maintain high prediction accuracy\.

## References

- A\. Amini, S\. Gabriel, S\. Lin, R\. Koncel\-Kedziorski, Y\. Choi, and H\. Hajishirzi \(2019\)Mathqa: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2357–2367\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- Z\. Azerbayev, H\. Schoelkopf, K\. Paster, M\. Dos Santos, S\. McAleer, A\. Q\. Jiang, J\. Deng, and S\. Biderman \(2024\)Llemma: an open language model for mathematics\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2)\.
- S\. Bergsmaet al\.\(2025\)Critical batch size revisited: a simple empirical approach to large\-batch language model training\.arXiv preprint arXiv:2505\.23971\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px3.p1.2)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- A\. Defazio, X\. Yang, K\. Mishchenko, A\. Khaled, H\. Mehta, and A\. Cutkosky \(2024\)The road less scheduled\.arXiv preprint arXiv:2405\.15682\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2)\.
- N\. Dey, G\. Gosal, Z\. Chen, H\. Khachane, W\. Marshall, R\. Pathria, M\. Tom, and J\. Hestness \(2023\)Cerebras\-gpt: open compute\-optimal language models trained on the cerebras wafer\-scale cluster\.arXiv preprint arXiv:2304\.03208\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px3.p1.2)\.
- K\. Everett, L\. Xiao, M\. Wortsman, A\. A\. Alemi, R\. Novak, P\. J\. Liu, I\. Gur, and J\. Pennington \(2024\)Scaling exponents across parameterizations and optimizers\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px3.p1.2)\.
- S\. Y\. Gadreet al\.\(2024\)Beyond chinchilla\-optimal: scaling laws for continued training\.arXiv preprint arXiv:2404\.01230\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p2.1)\.
- K\. Gupta, B\. Thérien, A\. Ibrahim,et al\.\(2023\)Continual pre\-training of large language models: how to \(re\) warm your model?\.arXiv preprint arXiv:2308\.04014\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2),[§1](https://arxiv.org/html/2606.05610#S1.p2.1),[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Hägele, E\. Bakouch, A\. Kosson, L\. Ben Allal, L\. Von Werra, and M\. Jaggi \(2024\)Scaling laws and compute\-optimal training beyond fixed training durations\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- D\. Hernandez, J\. Kaplan, T\. Henighan, and S\. McCandlish \(2021\)Scaling laws for transfer\.arXiv preprint arXiv:2102\.01293\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p2.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.15556\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Hu, Y\. Tu, X\. Han, C\. He, G\. Cui, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, W\. Zhao,et al\.\(2024\)MiniCPM: unveiling the potential of small language models with scalable training strategies\.arXiv preprint arXiv:2404\.06395\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Ibrahim, B\. Thérien, K\. Gupta, M\. Lukoševičius, and I\. Rish \(2024\)Simple and scalable strategies to continually pre\-train large language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p2.1),[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px2.p1.1)\.
- N\. Jain, K\. Han, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2024\)Livecodebench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.05610#S2.SS3.SSS0.Px3.p4.3),[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px1.p1.1)\.
- Z\. Ke, Y\. Shao, H\. Lin, T\. Konuk, and H\. Lovett \(2023\)Continual pre\-training of language models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2)\.
- H\. Li, W\. Zheng, J\. Hu, Q\. Wang, H\. Zhang, Z\. Wang, S\. Xuyang, Y\. Fan, S\. Zhou, X\. Zhang, and D\. Jiang \(2025\)Predictable scale: part i, step law – optimal hyperparameter scaling law in large language model pretraining\.arXiv preprint arXiv:2503\.04715\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px1.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§3\.2](https://arxiv.org/html/2606.05610#S3.SS2.p2.4)\.
- Y\. Luo, Z\. Yang, F\. Meng, Y\. Li, J\. Zhou, and Y\. Zhang \(2024\)An empirical study of catastrophic forgetting in large language models during continual fine\-tuning\.arXiv preprint arXiv:2308\.08747\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2),[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px2.p1.1)\.
- S\. McCandlish, J\. Kaplan, D\. Amodei, and O\. D\. Team \(2018\)An empirical model of large\-batch training\.arXiv preprint arXiv:1812\.06162\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px3.p1.2)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2381–2391\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- J\. Parmaret al\.\(2024\)Reuse, don’t retrain: a recipe for continued pretraining of language models\.arXiv preprint arXiv:2407\.07263\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px2.p1.1)\.
- K\. Paster, M\. D\. Santos, Z\. Azerbayev, and J\. Ba \(2023\)Openwebmath: an open dataset of high\-quality mathematical web text\.arXiv preprint arXiv:2310\.06786\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px1.p1.1)\.
- B\. Rozière, J\. Gehring, F\. Gloeckle, S\. Sootla, I\. Gat, X\. E\. Tan, Y\. Adi, J\. Liu, T\. Remez, J\. Rapin,et al\.\(2023\)Code llama: open foundation models for code\.arXiv preprint arXiv:2308\.12950\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- Y\. Shen, M\. Stallone, M\. Mishra, G\. Zhang, S\. Tan, A\. Prasad, A\. M\. Soria, D\. D\. Cox, and R\. Panda \(2024\)Power scheduler: a batch size and token number agnostic learning rate scheduler\.arXiv preprint arXiv:2408\.13359\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px3.p1.2)\.
- X\. Shuaiet al\.\(2024\)Scaling law for language models training considering batch size\.arXiv preprint arXiv:2412\.01505\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- M\. Weber, D\. Fu, Q\. Anthony, Y\. Oren, S\. Adams, A\. Alexandrov, X\. Lyu, H\. Nguyen, X\. Yao, V\. Adams,et al\.\(2024\)Redpajama: an open dataset for training large language models\.Advances in neural information processing systems37,pp\. 116462–116492\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px1.p1.1)\.
- K\. Wen, Z\. Li, J\. Wang, D\. Hall, P\. Liang, and T\. Ma \(2024\)Understanding warmup\-stable\-decay learning rates: a river valley loss landscape perspective\.arXiv preprint arXiv:2410\.05192\.Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2),[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px2.p1.1)\.
- S\. M\. Xie, H\. Pham, X\. Dong, N\. Du, H\. Liu, Y\. Lu, P\. Liang, Q\. V\. Le, T\. Ma, and A\. W\. Yu \(2024\)DoReMi: optimizing data mixtures speeds up language model pretraining\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p2.1)\.
- G\. Yang, E\. J\. Hu, I\. Babuschkin, S\. Sidor, X\. Liu, D\. Farhi, N\. Ryder, J\. Pachocki, W\. Chen, and J\. Gao \(2022\)Tensor programs v: tuning large neural networks via zero\-shot hyperparameter transfer\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2606.05610#S1.p1.2)\.
- G\. Yang, D\. Yu, C\. Zhu, and S\. Hayou \(2023\)Tensor programs vi: feature learning in infinite\-depth neural networks\.arXiv preprint arXiv:2310\.02244\.Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px3.p1.2)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4791–4800\.Cited by:[§3\.1](https://arxiv.org/html/2606.05610#S3.SS1.SSS0.Px2.p1.1)\.
- H\. Zhang, D\. Morwani, N\. Vyas, J\. Wu, D\. Zou, U\. Ghai, D\. Foster, and S\. Kakade \(2025\)How does critical batch size scale in pre\-training?\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§4](https://arxiv.org/html/2606.05610#S4.SS0.SSS0.Px1.p1.1)\.

![Refer to caption](https://arxiv.org/html/2606.05610v1/x4.png)\(a\)Impact of Batch Size \(BB\)
![Refer to caption](https://arxiv.org/html/2606.05610v1/x5.png)\(b\)Impact of Learning Rate \(L​RLR\)

Figure 4:Validation loss trajectories during continued pre\-training\. \(a\) Larger batch sizes \(e\.g\.,B=256B=256\) enhance gradient stability and yield lower terminal loss, albeit with slower initial reduction\. \(b\) A lower learning rate \(e\.g\.,L​R=1×10−4LR=1\\times 10^\{\-4\}\) effectively mitigates late\-stage loss oscillations, ensuring superior convergence compared to more aggressive schedules\.Table 6:Detailed architectural specifications for Dense\-8B and MoE\-3B models\.## Appendix AModel Architecture Details

This appendix provides the architectural specifications of the models evaluated in our study\. Table[6](https://arxiv.org/html/2606.05610#A0.T6)summarizes the configurations for both the Dense\-8B and MoE\-3B models\. The MoE\-3B model specifically incorporates Multi\-Head Latent Attention \(MLA\) and a sparse Mixture\-of\-Experts \(MoE\) layer to optimize compute efficiency\.

## Appendix BEmpirical Trends of Hyperparameters During Training

In this section, we examine the empirical evolution of key hyperparameters throughout the training process\. Our observations suggest a systematic trend: as training progresses and the model moves toward convergence, achieving optimal performance necessitates a progressively larger batch size and a reduced learning rate\.

#### Trends in Batch Size\.

Figure[4](https://arxiv.org/html/2606.05610#A0.F4)\(a\) illustrates the training loss trajectories across different batch sizes \(B∈\{64,128,256\}B\\in\\\{64,128,256\\\}\)\. Initially, the largest batch size \(B=256B=256\) exhibits a slower rate of loss reduction per token compared to smaller configurations\. However, as training progresses, a distinct crossover occurs: theB=256B=256configuration eventually achieves a lower loss, ultimately converging to the most favorable terminal value\. This suggests that while smaller batches provide a stronger stochastic signal early on, larger batches are essential for fine\-grained optimization in the later stages\.

#### Trends in Learning Rate\.

A parallel trend is observed for the learning rate, as shown in Figure[4](https://arxiv.org/html/2606.05610#A0.F4)\(b\)\. Although the smallest learning rate \(L​R=1×10−4LR=1\\times 10^\{\-4\}\) initially lags behind more aggressive schedules in terms of reduction speed, it shows superior long\-term stability\. Eventually, theL​R=1×10−4LR=1\\times 10^\{\-4\}configuration outperforms its counterparts, attaining a lower final loss\. This reinforces the necessity of decaying the learning rate to navigate the flatter regions of the loss landscape as training matures\.

## Appendix CScaling Laws for Optimal Hyperparameters

Table 7:Linear regression parameters for optimal hyperparameters \(y=a​x\+by=ax\+b\)\. Here,xxdenotes the hyperparameter \(ln⁡\(L​R\)\\ln\(LR\)orBB\), andyyrepresentsln⁡\(FLOPs\)\\ln\(\\text\{FLOPs\}\)\.To quantitatively analyze the evolution of optimal hyperparameters relative to the computational scale, we perform a linear regression of the optimal hyperparameter values \(xx\) against the logarithm of the total compute budget \(y=ln⁡\(FLOPs\)y=\\ln\(\\text\{FLOPs\}\)\)\. We model this relationship asy=a​x\+by=ax\+b, whereaarepresents the sensitivity of the compute budget to hyperparameter changes\. The resulting scaling parameters and goodness\-of\-fit \(R2R^\{2\}\) for both Dense and MoE architectures are summarized in Table[7](https://arxiv.org/html/2606.05610#A3.T7)\.

#### Analysis of Learning Rate Scaling\.

The regression analysis reveals divergent scaling trajectories between Dense and MoE architectures, reflecting fundamental differences in their optimization dynamics\. For the learning rate, the scaling coefficientd​x/d​ydx/dy\(calculated as1/a1/a\) for MoE \(≈−0\.496\\approx\-0\.496\) is significantly less negative than that of Dense models \(≈−1\.338\\approx\-1\.338\)\. This indicates that MoE architectures exhibit reduced sensitivity to computational expansion\. We hypothesize that this stems from the sparse activation pattern: while the total parameter count scales, the number of active parameters per token grows more conservatively\. Consequently, MoE models maintain lower optimization inertia, allowing them to sustain larger, more effective step sizes without the stability risks typically associated with scaling dense networks\.

#### Analysis of Batch Size Scaling\.

Regarding the batch size, the growth rated​x/d​ydx/dyfor MoE \(≈238\.1\\approx 238\.1\) is more than double that of Dense models \(≈107\.5\\approx 107\.5\)\. This reveals a disproportionate reliance on batch parallelism in sparse architectures\. From a statistical perspective, MoE models require a much larger global batch to ensure sufficient token coverage for each expert and to stabilize the routing mechanism\. As the model scales, an aggressive expansion of batch size is mandatory to mitigate the stochastic variance in expert selection and to maintain gradient fidelity across sparse layers\. This implies that for future exascale MoE models, the primary scaling bottleneck may shift from raw compute to the efficient utilization of massive batch\-level parallelism\.

## Appendix DModeling and Prediction Algorithm

This appendix provides the formal procedural description of our proposed framework, covering both the cold\-start modeling of scaling laws and the zero\-shot hyperparameter prediction for a given checkpoint\.

Algorithm 1Hyperparameter Scaling Law Framework for Continued Pre\-training0:Small\-scale proxy models

𝒩\\mathcal\{N\}, target dataset

𝒟cpt\\mathcal\{D\}\_\{\\text\{cpt\}\}, initial pretrained checkpoint

Mθ0M\_\{\\theta\_\{0\}\}, planned compute budget

CcptC\_\{\\text\{cpt\}\}
0:Predicted optimal hyperparameters

BoptB\_\{\\text\{opt\}\}and

L​RoptLR\_\{\\text\{opt\}\}
1:Stage 1: Cold\-Start Modeling \(Offline\)

2:foreach model size

N∈𝒩N\\in\\mathcal\{N\}do

3:Perform grid search over

\(B,L​R\)\(B,LR\)on

𝒟cpt\\mathcal\{D\}\_\{\\text\{cpt\}\}\.

4:Construct the optimal scaling law of

\(Loss​L,Compute​C\)\(\\text\{Loss \}L,\\text\{Compute \}C\)trajectories\.

5:Identify optimal configurations

\(Bopt,L​Ropt\)\(B\_\{\\text\{opt\}\},LR\_\{\\text\{opt\}\}\)that minimize

CCfor a given

LL\.

6:endfor

7:Fit mapping functions:

Bopt=f​\(L\)B\_\{\\text\{opt\}\}=f\(L\)and

L​Ropt=g​\(L\)LR\_\{\\text\{opt\}\}=g\(L\)\.

8:Fit the loss\-compute scaling law:

L​\(C\)=L0\+α​C−γL\(C\)=L\_\{0\}\+\\alpha C^\{\-\\gamma\}\.

9:Stage 2: Hyperparameter Prediction \(Online\)

10:

Linit←Evaluate​\(Mθ0,𝒟cpt\)L\_\{\\text\{init\}\}\\leftarrow\\text\{Evaluate\}\(M\_\{\\theta\_\{0\}\},\\mathcal\{D\}\_\{\\text\{cpt\}\}\)\{Initial validation loss on target data\}

11:

Cpre←\(αLinit−L0\)1/γC\_\{\\text\{pre\}\}\\leftarrow\\left\(\\frac\{\\alpha\}\{L\_\{\\text\{init\}\}\-L\_\{0\}\}\\right\)^\{1/\\gamma\}\{Estimate equivalent pre\-training compute\}

12:

Ctotal←Cpre\+CcptC\_\{\\text\{total\}\}\\leftarrow C\_\{\\text\{pre\}\}\+C\_\{\\text\{cpt\}\}\{Calculate total effective compute budget\}

13:

Lt​a​r​g​e​t←L0\+α⋅Ctotal−γL\_\{target\}\\leftarrow L\_\{0\}\+\\alpha\\cdot C\_\{\\text\{total\}\}^\{\-\\gamma\}\{Predict target loss at

CtotalC\_\{\\text\{total\}\}\}

14:return

Bopt=f​\(Lt​a​r​g​e​t\)B\_\{\\text\{opt\}\}=f\(L\_\{target\}\),

L​Ropt=g​\(Lt​a​r​g​e​t\)LR\_\{\\text\{opt\}\}=g\(L\_\{target\}\)

Similar Articles

Scaling Laws for Mixture Pretraining Under Data Constraints

arXiv cs.LG

This paper studies the trade-off between scarce target data and abundant generic data in mixture pretraining, finding that repetition is a key driver of performance and that mixture training tolerates 15-20 repetitions of target data. It introduces a repetition-aware scaling law to optimize mixture configurations under data constraints.

A Theory of Training Profit-Optimal LLMs

arXiv cs.LG

This paper develops an economic model combining scaling laws with microeconomic theory to analyze profit-optimal training of large language models, considering trade-offs between model quality, training costs, and hardware efficiency.