How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size

arXiv cs.LG 07/03/26, 04:00 AM Papers
scaling-laws batch-size training-steps large-language-models deep-learning compute-optimal token-allocation
Summary
Proposes a three-term scaling law that decouples model size, training steps, and batch size, enabling robust fitting with fewer runs and deriving scaling laws for suboptimal batch sizes.
arXiv:2607.01487v1 Announce Type: new Abstract: We propose a scaling law that takes into account model size and training data while explicitly splitting the latter into training steps and batch size (called three-term law). Fitting the proposed law on a large set of training runs, we find that it correctly recovers the scaling of the optimal batch size. Moreover, because it makes use of training runs with suboptimal batch size, our proposed law can be robustly fit with a significantly smaller amount of training runs. We further show that the three-term law can be used to derive scaling laws for suboptimal batch sizes, and that it matches previous empirical findings related to the critical batch size.
Original Article
View Cached Full Text
Cached at: 07/03/26, 05:41 AM
# How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size
Source: [https://arxiv.org/html/2607.01487](https://arxiv.org/html/2607.01487)
Fabian Schaipp111Corresponding email:[fabian\.schaipp@inria\.fr](https://arxiv.org/html/2607.01487v1/[email protected])222Inria, École Normale Supérieure, PSL Research University, Paris

###### Abstract

We propose a scaling law that takes into account model size and training data while explicitly splitting the latter into training steps and batch size \(called*three\-term law*\)\. Fitting the proposed law on a large set of training runs, we find that it correctly recovers the scaling of the optimal batch size\. Moreover, because it makes use of training runs with suboptimal batch size, our proposed law can be robustly fit with a significantly smaller amount of training runs\. We further show that the three\-term law can be used to derive scaling laws for suboptimal batch sizes, and that it matches previous empirical findings related to the critical batch size\.

## 1Introduction

The field of deep learning and specifically large language models \(LLMs\) has seen an enormous progress over the last few years\. Much of this progress has been attributed to “simply” scaling up, both in terms of model size and data used for training\. Improvements in model performance often follow predictable trends, called*scaling laws*, which have been found in the context of LLM pre\-training\(Kaplanet al\.,[2020](https://arxiv.org/html/2607.01487#bib.bib12); Hoffmannet al\.,[2022](https://arxiv.org/html/2607.01487#bib.bib10)\), but also in many other areas of deep learning such as vision\(Zhaiet al\.,[2022](https://arxiv.org/html/2607.01487#bib.bib25)\), weather forecasting\(Bodnaret al\.,[2024](https://arxiv.org/html/2607.01487#bib.bib5)\), or protein modeling\(Linet al\.,[2023](https://arxiv.org/html/2607.01487#bib.bib16)\)\.

##### Scaling laws for model size and data\.

In the context of training large language models, scaling laws classically refer to functional forms that allow to infer the optimal allocation of compute into model sizeNNand training examplesDD\. Seminal works byKaplanet al\.\([2020](https://arxiv.org/html/2607.01487#bib.bib12)\)andHoffmannet al\.\([2022](https://arxiv.org/html/2607.01487#bib.bib10)\)show that the test loss predictably decreases when increasingNNorDD\. A widely\-used technique to obtain a scaling law, known as Chinchilla Approach 3, is to model the \(final test\) loss as a sum of power\-laws inNNandDD, i\.e\.

ℒ\(N,D\)=E\+ANα\+BDβ\.\\displaystyle\\mathcal\{L\}\(N,D\)=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\\frac\{B\}\{D^\{\\beta\}\}\.\(1\)The parameters\(E,A,B,α,β\)\(E,A,B,\\alpha,\\beta\)are fitted from a set of training runs\. One can then derive the optimal model size from \([1](https://arxiv.org/html/2607.01487#S1.E1)\) for a given compute constraint, and potentially extrapolate this to values of\(N,D\)\(N,D\)outside of the experimentally tested ranges\. On the other hand, several works show that the fitting procedure itself is delicate, and design choices in both the training procedure as well as the fitting techniques can impact the result\(Besirogluet al\.,[2024](https://arxiv.org/html/2607.01487#bib.bib3); Liet al\.,[2025b](https://arxiv.org/html/2607.01487#bib.bib15)\)\. Further, it is not guaranteed that the law generalizes well, which however is crucial to derive optimal configurations for large\-scale training runs\.

##### Hyperparameter scaling laws\.

While scaling laws of the Chinchilla form can inform compute\-optimal allocation ofNNandDD, they do not explicitly account for training hyperparameters\. To address this, several recent works derive scaling laws for the optimal learning rate and/or batch size as a function of\(N,D\)\(N,D\)\(DeepSeek\-AIet al\.,[2024](https://arxiv.org/html/2607.01487#bib.bib7); Liet al\.,[2025a](https://arxiv.org/html/2607.01487#bib.bib14); von Rütteet al\.,[2026](https://arxiv.org/html/2607.01487#bib.bib19)\)\. Typically, these works simply assume a power\-law relationship for the hyperparameter of interest \(e\.g\. the optimal batch size as a function ofDD\) and fit the coefficients to data from training runs\. However, these approaches are not directly compatible with laws of form \([1](https://arxiv.org/html/2607.01487#S1.E1)\) as they do not model the loss value\.

A different line of work is based on the concept of the*critical batch size*\(McCandlishet al\.,[2018](https://arxiv.org/html/2607.01487#bib.bib17); Shallueet al\.,[2018](https://arxiv.org/html/2607.01487#bib.bib29)\): it describes the phenomenon that the number of stepsKKrequired to reach a target loss, as a function of the batch size, will at some point decrease much slower than inversely linear\.McCandlishet al\.\([2018](https://arxiv.org/html/2607.01487#bib.bib17)\)model this with the equation

\(K/Kmin−1\)\(D/Dmin−1\)=1\.\\displaystyle\(K/K\_\{\\min\}\-1\)\(D/D\_\{\\min\}\-1\)=1\.\(2\)Here,KminK\_\{\\min\}\(resp\.DminD\_\{\\min\}\) is the minimum number of steps \(resp\. number of tokens\) required to reach the target loss, and can be fit empirically\. The critical batch size is then defined asDmin/KminD\_\{\\min\}/K\_\{\\min\}\.Bergsmaet al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib2)\)establish a connection to weight decay\. In an earlier work,Kaplanet al\.\([2020](https://arxiv.org/html/2607.01487#bib.bib12)\)use the same model of critical batch size to relate the loss to the number of training steps\.

A central issue with \([2](https://arxiv.org/html/2607.01487#S1.E2)\) is that it implies an*optimal batch size of one*\(Bergsmaet al\.,[2025](https://arxiv.org/html/2607.01487#bib.bib2)\); however, this is in conflict with the empirical situation when usingAdamW, where the optimal batch size has been found to scale with the available token budgetDD\(Porianet al\.,[2024](https://arxiv.org/html/2607.01487#bib.bib18); Liet al\.,[2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)\.

##### Scaling laws and optimization theory\.

From a theoretical point of view, the relationship between loss, batch size and number of training steps is inherently related to optimization theory\. Recent works byShulginet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib23)\); Islamovet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib11)\)derive hyperparameter scaling laws directly from convergence bounds for stochastic conditional gradient methods\. Moreover, it has been shown previously that hyperparameter tuning/transfer for LLM training can be informed by optimization theory, for example in the context of learning\-rate schedules\(Schaippet al\.,[2025](https://arxiv.org/html/2607.01487#bib.bib21)\)or of weight decay\(Wang and Aitchison,[2025](https://arxiv.org/html/2607.01487#bib.bib24)\)\.

## 2Overview

##### Our proposed laws\.

Here, we propose to model the loss as a power\-law function of\(N,M,K\)\(N,M,K\)whereMMis the batch size in tokens andKKis the number of training steps, that is,

ℒ\(N,M,K\)=E\+ANα\+BMβ\+CKγ\.\\displaystyle\\mathcal\{L\}\(N,M,K\)=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\\frac\{B\}\{M^\{\\beta\}\}\+\\frac\{C\}\{K^\{\\gamma\}\}\.This has several natural advantages:

1. \(i\)Our law brings together the Chinchilla form \([1](https://arxiv.org/html/2607.01487#S1.E1)\) with scaling laws for the optimal batch size\. Under a constrained data budgetD=MKD=MK, our law implies a scaling rule for the optimal batch size withDD, while at the same time collapsing to a Chinchilla form when using the optimal batch size\.
2. \(ii\)The proposed form is also closely connected to and inspired by theoretical results from stochastic optimization, seeShulginet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib23)\); Kovalev \([2025](https://arxiv.org/html/2607.01487#bib.bib13)\)\. This allows to bridge from empirical scaling analysis to a more theoretical understanding\.
3. \(iii\)The proposed law can be fit with runs from suboptimal batch sizes, which \(as we will show\) drastically reduces the number of training runs needed for fitting\.
4. \(iv\)While previous scaling laws only model the performance with*optimal*hyperparameters, our formulation describes performance also in the suboptimal batch size regime\. This can be important in practice when facing hardware constraints\.

##### Summary of our findings\.

We fit our proposed laws on two datasets of training runs of \(dense\) LLMs, from here on referred to asLi\(Liet al\.,[2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)andOpenEuroLLM\([OpenEuroLLM Consortium,](https://arxiv.org/html/2607.01487#bib.bib30)\)\. Both datasets cover multiple model sizes, token budgets and batch sizes\.333TheOpenEuroLLMdataset is not yet public at the time of writing\. We will make our codebase for reproducing all experiments public once theOpenEuroLLMdataset has been released\.

1. \(i\)Our law results in an implied optimal batch size scaling that is consistent with previous hyperparameter scaling laws that do not model the loss value \(see[Section4\.1](https://arxiv.org/html/2607.01487#S4.SS1)\)\. In particular, we find that with our formulation two batch sizes per\(N,D\)\(N,D\)suffice to robustly find this law \(instead of doing a full sweep\); this reduces the number of training runs needed to 28% \(see[Section4\.2](https://arxiv.org/html/2607.01487#S4.SS2)\)\.
2. \(ii\)By construction, our proposed law results in non\-trivial optimal batch sizes \(in contrast to previous formulations, as mentioned above\) and that are independent of model size, matching the empirical results ofLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)\. At the same time, it describes how the critical batch size scales withNNand/orDDconsistently with findings byZhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26)\)\(see[Section4\.4](https://arxiv.org/html/2607.01487#S4.SS4)\)\.
3. \(iii\)For situations where the optimal batch size might be infeasible due to practical constraints, we derive scaling laws forε\\varepsilon\-suboptimal batch sizes that generalize well to out\-of\-sample token budgets \(see[Section4\.3](https://arxiv.org/html/2607.01487#S4.SS3)\)\.

##### Notation\.

The table below summarizes the most important notation used throughout the document\. Note that it holdsD=MKD=MK\.

## 3Scaling Laws with Training Steps and Batch Size

Recall the Chinchilla law proposed byHoffmannet al\.\([2022](https://arxiv.org/html/2607.01487#bib.bib10)\): forA,B,E\>0A,B,E\>0andα,β\>0\\alpha,\\beta\>0let the loss be parametrized as

ℒ\(N,D\)=E\+ANα\+BDβ\.\\displaystyle\\mathcal\{L\}\(N,D\)=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\\frac\{B\}\{D^\{\\beta\}\}\.Here,ℒ\\mathcal\{L\}usually refers to the test loss at the end of training \(or a smoothed version of it\)\. Using the training runs fromHoffmannet al\.\([2022](https://arxiv.org/html/2607.01487#bib.bib10)\), but with a more precise fitting procedure,Besirogluet al\.\([2024](https://arxiv.org/html/2607.01487#bib.bib3)\)report the law

ℒ\(N,D\)=1\.8172\+482\.01N0\.3478\+2085\.43D0\.3658\.\\displaystyle\\mathcal\{L\}\(N,D\)=1\.8172\+\\frac\{482\.01\}\{N^\{0\.3478\}\}\+\\frac\{2085\.43\}\{D^\{0\.3658\}\}\.\(EpochAI\)
The above scaling law doesnottake into account the batch size, which however plays a crucial role for training performance\(Bergsmaet al\.,[2025](https://arxiv.org/html/2607.01487#bib.bib2); Zhanget al\.,[2025](https://arxiv.org/html/2607.01487#bib.bib26)\)\. In particular,Hoffmannet al\.\([2022](https://arxiv.org/html/2607.01487#bib.bib10)\)report only that they used “well\-tested heuristics” for the choice of batch size, but they do not study its effect on the scaling analysis\.

Here, we propose to take into account how the token budgetDDis allocated intotraining stepsKKand batch sizebb\. We first describe two similar approaches to do so\.

##### Approach I: A three\-term law\.

Following the power\-law approach, we propose the functional form

ℒ\(N,M,K\)=E\+ANα\+BMβ\+CKγ,\\displaystyle\\mathcal\{L\}\(N,M,K\)=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\\frac\{B\}\{M^\{\\beta\}\}\+\\frac\{C\}\{K^\{\\gamma\}\},\(3\)where\(E,A,B,C,α,β,γ\)\(E,A,B,C,\\alpha,\\beta,\\gamma\)are fittable parameters\. This has the advantage that for optimal batch size choice it reduces automatically to the original Chinchilla form \([1](https://arxiv.org/html/2607.01487#S1.E1)\): to see this, minimize \([3](https://arxiv.org/html/2607.01487#S3.E3)\) with respect toMM, subject toD=MKD=MK\. The optimal batch size is

M⋆=\[βBγC\]1β\+γDγβ\+γ=:GDγβ\+γ\.\\displaystyle M^\{\\star\}=\\Big\[\\frac\{\\beta B\}\{\\gamma C\}\\Big\]^\{\\frac\{1\}\{\\beta\+\\gamma\}\}D^\{\\frac\{\\gamma\}\{\\beta\+\\gamma\}\}=:GD^\{\\frac\{\\gamma\}\{\\beta\+\\gamma\}\}\.\(4\)In particular, the optimal batch size is independent of the model sizeNN\. Plugging backM⋆M^\{\\star\}andK⋆=D/M⋆K^\{\\star\}=D/M^\{\\star\}into \([3](https://arxiv.org/html/2607.01487#S3.E3)\) gives

ℒ\(N,D\)=E\+ANα\+B^Dτ,τ:=βγβ\+γ,B^:=BG−β\+CGγ\.\\displaystyle\\mathcal\{L\}\(N,D\)=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\\frac\{\\hat\{B\}\}\{D^\{\\tau\}\},\\quad\\tau:=\\frac\{\\beta\\gamma\}\{\\beta\+\\gamma\},\\quad\\hat\{B\}:=BG^\{\-\\beta\}\+CG^\{\\gamma\}\.\(5\)Hence, under \([3](https://arxiv.org/html/2607.01487#S3.E3)\), the optimal way to split a fixed token budgetDDinto\(K,b\)\(K,b\)recovers the form \([1](https://arxiv.org/html/2607.01487#S1.E1)\) by re\-parameterizing\(B,β\)→\(B^,τ\)\(B,\\beta\)\\to\(\\hat\{B\},\\tau\)\.

Equation \([3](https://arxiv.org/html/2607.01487#S3.E3)\) is structurally very similar to convergence bounds from stochastic optimization\. Let\(xk\)k∈ℕ\(x\_\{k\}\)\_\{k\\in\\mathbb\{N\}\}be the iterates of the stochastic conditional gradient method wrt\. a general norm\. Assuming theμ\\mu\-KL condition for the loss function \(seeIslamovet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib11)\)\), for fixed batch sizeMM\(orbb\) and training stepsKK, under the optimal learning rateη⋆\\eta^\{\\star\},Shulginet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib23), Thm\. 1\)\(derived fromKovalev \([2025](https://arxiv.org/html/2607.01487#bib.bib13)\)\) states that

min1≤k≤K⁡𝔼\[ℒ\(xk\)\]≲ℒ⋆\+1M\+1K,\\displaystyle\\min\_\{1\\leq k\\leq K\}\\mathbb\{E\}\[\\mathcal\{L\}\(x\_\{k\}\)\]\\lesssim\\mathcal\{L\}\_\{\\star\}\+\\frac\{1\}\{\\sqrt\{M\}\}\+\\frac\{1\}\{\\sqrt\{K\}\},where≲\\lesssimdenotes that the bound holds up to some multiplicative constant for each term on the right\-hand side\. Comparing to \([3](https://arxiv.org/html/2607.01487#S3.E3)\), the optimal loss is parametrized byE\+ANαE\+\\frac\{A\}\{N^\{\\alpha\}\}, and for the other terms the powers are relaxed from1/21/2to\(β,γ\)\(\\beta,\\gamma\)\.

##### Approach II: Model\-specific two\-term laws\.

The functional form \([3](https://arxiv.org/html/2607.01487#S3.E3)\) implicitly assumes that the model size does not impact the coefficients\(B,C\)\(B,C\)and\(β,γ\)\(\\beta,\\gamma\), and therefore by constructionM⋆M^\{\\star\}is independent ofNN\. While this implicit assumption is supported by previous experimental results on LLMs\(Liet al\.,[2025a](https://arxiv.org/html/2607.01487#bib.bib14); Zhanget al\.,[2025](https://arxiv.org/html/2607.01487#bib.bib26)\), as an alternative, we canfix the model sizeand fit the functional form

ℒ\(M,K\)=E\+BMβ\+CKγ,\\displaystyle\\mathcal\{L\}\(M,K\)=E\+\\frac\{B\}\{M^\{\\beta\}\}\+\\frac\{C\}\{K^\{\\gamma\}\},\(6\)where again\(E,B,C,β,γ\)\(E,B,C,\\beta,\\gamma\)are fittable parameters\. We can fit this form \(independently\) for multipleNN, a priori allowing for different values of\(B,C,β,γ\)\(B,C,\\beta,\\gamma\)across model sizes\.

##### Terminology\.

From now on, we refer to laws of the form \([6](https://arxiv.org/html/2607.01487#S3.E6)\) as*two\-term law*, and to laws of the form \([3](https://arxiv.org/html/2607.01487#S3.E3)\) as*three\-term law*\. When convenient, we will use the abbreviations2TLand3TLrespectively\.

## 4Experiments

We fit scaling laws of the form2TLand3TLon two datasets which contain training logs for multiple combinations of\(N,M,K\)\(N,M,K\)usingAdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2607.01487#bib.bib33)\), and which include a learning\-rate sweep for each combination\. We refer to the two datasets asLiandOpenEuroLLM; details are described in[SectionA\.1](https://arxiv.org/html/2607.01487#A1.SS1)in the Appendix, in particular see[Table4](https://arxiv.org/html/2607.01487#A1.T4)\. Here, we focus on theLidataset; the results forOpenEuroLLMare deferred to the Appendix\. For each\(N,M,K\)\(N,M,K\), we choose the smallest final smoothened test loss value across learning rates\. We form a validation set by collecting the datapoints from the largest token budgetDDfor each individualNN; the remaining datapoints are used as training set for fitting the law\. We usually mark points from the validation set with a dashed border in our plots\. See[SectionA\.1](https://arxiv.org/html/2607.01487#A1.SS1)for additional details and[Figs\.13](https://arxiv.org/html/2607.01487#A1.F13)and[15](https://arxiv.org/html/2607.01487#A1.F15)for an overview of theLidataset\.

##### Fitting procedure\.

If not explicitly mentioned otherwise, we use a standard fitting procedure minimizing the Huber loss withL\-BFGS\-Bfrom multiple initializations, and five\-fold cross\-validation for each law\. For the Huber loss we useδ=10−3\\delta=10^\{\-3\}; for a detailed description and additional ablations on these choices, we refer to[SectionA\.1](https://arxiv.org/html/2607.01487#A1.SS1)and references therein\. All fitted parameter values can be found in[SectionA\.3](https://arxiv.org/html/2607.01487#A1.SS3)in the Appendix\.Note thatwhenever we report a single number for a parameter, it is the average over the five cross\-validation runs\.

### 4\.1Comparison of Approach I and II

A priori, it is not clear which of the two approaches described in[Section3](https://arxiv.org/html/2607.01487#S3)is preferable\. Before doing any experimental comparison, we want to mention the main differences of2TLand3TLthat arise from their definition\. First, for3TLwe will usually have more datapoints per \(fittable\) parameter, as we can use training runs across all model sizesNN\. For example, for theLidataset, we have 246 samples to fit seven parameters for3TL, while for2TLwe have 13\-60 samples to fit five parameters\. Consequently, we expect that the in\-sample fitting error of2TLwill be smaller\. Second, based on the results ofLiet al\.\([2025b](https://arxiv.org/html/2607.01487#bib.bib15)\), we would expect that3TLis more delicate/unstable to fit, given that it has two more fittable parameters\.

We compare both approaches with respect to the following evaluations: \(i\) quality of fit, \(ii\) consistency of the scaling coefficients\(β,γ\)\(\\beta,\\gamma\)and \(iii\) consistency of the resulting optimal batch sizeM⋆M^\{\\star\}from \([4](https://arxiv.org/html/2607.01487#S3.E4)\) with previously reported results\. We fit a2TLfor each model sizeNN\(using only the respective subset of datapoints\), and a single3TLusing the union of all datapoints \(acrossNN\)\.

For \(i\), we compute the mean absolute deviation \(MAD\) for the3TLas well as for the2TL, where each2TLis evaluated on the subset of datapoints respective to the model size\. We then compute the average MAD across all2TL, weighted by the sample size, which we call the*two\-term law ensemble*\. This is done separately for training and validation set \(see[Fig\.2](https://arxiv.org/html/2607.01487#S4.F2)\)\. For \(ii\) and \(iii\), we plot the estimated values of\(β,γ\)\(\\beta,\\gamma\)as well as the resulting formula forM⋆M^\{\\star\}from \([4](https://arxiv.org/html/2607.01487#S3.E4)\)\. We can directly compare the latter to the scaling law for optimal batch size reported byLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\), which was obtained from the same dataset and which we refer to asStep\-Law\(see[Fig\.1](https://arxiv.org/html/2607.01487#S4.F1)\)\.

Discussion:As expected,[Fig\.2](https://arxiv.org/html/2607.01487#S4.F2)shows that the in\-sample fit of2TLis better; however, the out\-of\-sample error of3TLis slightly smaller\. This indicates that despite having two additional parameters in3TL, our fitting procedure is sufficiently robust thanks to multiple initializations and bootstrap aggregation\. Moreover,[Fig\.1](https://arxiv.org/html/2607.01487#S4.F1)shows that also in terms of consistency of the implied scaling ofM⋆M^\{\\star\}the three\-term law approach is preferable\. In particular, its resulting scaling ofM⋆M^\{\\star\}almost perfectly overlaps with the one byLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)\.

inline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolorinline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolortodo:inline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolorTakeaway:The three\-term law approach achieves an overall slightly better out\-of\-sample fit than the two\-term laws\. Its implied scaling for the optimal batch sizeM⋆∼DM^\{\\star\}\\sim Dis consistent with previous analyses \(see[Table1](https://arxiv.org/html/2607.01487#S4.T1)\)\.##### Consistency across datasets\.

We run the same analysis on theOpenEuroLLMdataset, see[SectionA\.4](https://arxiv.org/html/2607.01487#A1.SS4)in the Appendix\. OnOpenEuroLLM, both approaches again lead to a good fit\. The scaling of the optimal batch size is relatively similar to theLidataset, albeit slightly steeper\. However, we also observe the caveat described below\.inline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolorinline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolortodo:inline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolorCaveat:The fitted parameter values for\(E,A,α\)\(E,A,\\alpha\)are quite different betweenLiandOpenEuroLLM; in particular, the fit forLiisE≈0E\\approx 0, indicating no irreducible loss which is in conflict to the non\-zero entropy of language\. This suggests that the effect of the model size is not perfectly reflected in3TL\.We hypothesize that the reason for scaling inconsistencies between the two datasets is due to different training setups\. For example,OpenEuroLLMuses differentβ2\\beta\_\{2\}values inAdamfor the larger models, which is known to have an impact on scaling laws\(Porianet al\.,[2024](https://arxiv.org/html/2607.01487#bib.bib18)\)\. This might also explain that the2TLhas a better out\-of\-sample error than3TLforOpenEuroLLM, as2TLhas more flexibility with respect to the impact ofNN\. We also remark that imposing small regularization onlog⁡\(E\)\\log\(E\)can alleviate the above caveat \(see[SectionA\.2](https://arxiv.org/html/2607.01487#A1.SS2)\)\.

![Refer to caption](https://arxiv.org/html/2607.01487v1/x1.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x2.png)

Figure 1:\(Left\)Estimates forM⋆M^\{\\star\}\-scaling coefficientγβ\+γ\\frac\{\\gamma\}\{\\beta\+\\gamma\}for3TLand each2TL\. Shaded area depicts min and max over five cross\-validation fits\.\(Right\)Implied scaling ofM⋆M^\{\\star\}according to \([4](https://arxiv.org/html/2607.01487#S3.E4)\)\. Shaded area depicts min and max over cross\-validation\. Dots show the empirically best batch size from the train \(black\) and validation split \(blue\)\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x3.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x4.png)

Figure 2:MAD comparison of2TLand3TLon train\(left\)and validation\(right\)split\.Table 1:Comparison of3TLto batch size scaling laws from the literature\.ReferenceScalingComment\(Liet al\.,[2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)M⋆=0\.58⋅D0\.571M^\{\\star\}=0\.58\\cdot D^\{0\.571\}Referred to as Step\-Law\(Bergsmaet al\.,[2025](https://arxiv.org/html/2607.01487#bib.bib2)\)M⋆=\(0\.0306⋅s\)⋅D0\.383M^\{\\star\}=\(0\.0306\\cdot s\)\\cdot D^\{0\.383\}\(DeepSeek\-AIet al\.,[2024](https://arxiv.org/html/2607.01487#bib.bib7)\)M⋆=0\.086⋅D0\.688M^\{\\star\}=0\.086\\cdot D^\{0\.688\}FromM⋆∝C0\.327M^\{\\star\}\\propto C^\{0\.327\},D⋆∝C0\.475D^\{\\star\}\\propto C^\{0\.475\}\(3TL,Lidataset\)\(ours\)M⋆=0\.667⋅D0\.566M^\{\\star\}=0\.667\\cdot D^\{0\.566\}From \([4](https://arxiv.org/html/2607.01487#S3.E4)\)

### 4\.2Compute Savings Using the Three\-term Law

Fitting a scaling law forM⋆M^\{\\star\}with the approach ofLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)imposes massive computational costs, as it requires to obtain the optimal batch size for a set of different token budgetsDD\(and possibly also varying the model sizeNN\)\.Liet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)report that producing their entire set of training runs consumed nearly one million NVIDIA H800 GPU hours\. Recall thatLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)fits a law of form

M⋆=A~Dα~\\displaystyle M^\{\\star\}=\\frac\{\\tilde\{A\}\}\{D^\{\\tilde\{\\alpha\}\}\}\(7\)directly on observations of\(M⋆,D\)\(M^\{\\star\},D\)\. For a single observation\(M⋆,D\)\(M^\{\\star\},D\), a full batch size sweep is needed to determineM⋆M^\{\\star\}\(in theLidataset, concretely we have 5\-10 batch sizes per sweep\)\. In contrast, our three\-term law makes explicit use of observations from*suboptimal*batch sizes; we will show that this allows to obtain the same scaling law ofM⋆M^\{\\star\}while saving a substantial amount of training runs/compute\.

Setup:We mask the original dataset \(containing the full batch size sweep\), such that for each combination of\(N,D\)\(N,D\)only \{one/two/three\} batch sizes are randomly selected\(see[Fig\.14](https://arxiv.org/html/2607.01487#A1.F14)for an illustration\)\. For theLidataset, this shrinks the number of training runs required/available for the fit to \{14/28/42\} per\-cent \(see[Table2](https://arxiv.org/html/2607.01487#S4.T2)\)\. We then fit the three\-term law on this reduced dataset\. As comparison, we fit \([7](https://arxiv.org/html/2607.01487#S4.E7)\) withM⋆M^\{\\star\}being the batch size \(after applying the mask as described above\) with the best loss, for each\(N,D\)\(N,D\)separately\.444This is similar to the procedure described byLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)\. Alternatively, one could first fit a quadratic \(or similar function\) to the batch size sweep, and then read offM⋆M^\{\\star\}as the minimum\. However, this is infeasible/unstable with≤3\\leq 3points available\. We try derivingM⋆M^\{\\star\}from a quadratic fit when having 4 points per sweep, but this does not improve the result, rather the opposite \(the resulting scaling is0\.02⋅D0\.7380\.02\\cdot D^\{0\.738\}\)\.We also use five\-fold cross\-validation to fit \([7](https://arxiv.org/html/2607.01487#S4.E7)\)\.

Discussion:We make two main observations: \(i\) Already with the full batch size sweep, \([7](https://arxiv.org/html/2607.01487#S4.E7)\) is unstable to removing the validation split\. A direct fit on the full dataset \(train and val\) with our code givesM⋆=0\.47⋅D0\.584M^\{\\star\}=0\.47\\cdot D^\{0\.584\}, essentially the same as Step\-Law\. Removing the validation split, we already get a quite different scaling ofM⋆=6\.29⋅D0\.468M^\{\\star\}=6\.29\\cdot D^\{0\.468\}\. \(ii\) The three\-term law results in almost identical scaling forM⋆M^\{\\star\}, even when reducing the batch size sweep to two values, hence reducing the number of required training runs to 28%\. Directly fitting \([7](https://arxiv.org/html/2607.01487#S4.E7)\) in contrast is highly unstable in this regime, and generalizes badly to higher \(unseen\) token budgets \(see[Fig\.3](https://arxiv.org/html/2607.01487#S4.F3)\)\. When masking to only one batch size per sweep, the results of both approaches are very distinct to the original law\.

Sweep size forbb3TLDirect fit \([7](https://arxiv.org/html/2607.01487#S4.E7)\)Samples/Training runsFullM⋆=0\.67⋅D0\.566M^\{\\star\}=0\.67\\cdot D^\{0\.566\}M⋆=6\.29⋅D0\.468M^\{\\star\}=6\.29\\cdot D^\{0\.468\}2463 valuesM⋆=0\.48⋅D0\.580M^\{\\star\}=0\.48\\cdot D^\{0\.580\}M⋆=8\.59⋅D0\.455M^\{\\star\}=8\.59\\cdot D^\{0\.455\}1022 valuesM⋆=0\.84⋅D0\.555M^\{\\star\}=0\.84\\cdot D^\{0\.555\}M⋆=2852\.95⋅D0\.210M^\{\\star\}=2852\.95\\cdot D^\{0\.210\}681 valueM⋆=5\.92⋅D0\.475M^\{\\star\}=5\.92\\cdot D^\{0\.475\}M⋆=3\.61⋅D0\.514M^\{\\star\}=3\.61\\cdot D^\{0\.514\}34Step\-Law\(Liet al\.,[2025a](https://arxiv.org/html/2607.01487#bib.bib14)\):M⋆=0\.58⋅D0\.571M^\{\\star\}=0\.58\\cdot D^\{0\.571\}Table 2:Laws forM⋆∼DM^\{\\star\}\\sim Dusing3TLvs\. a direct fit, for different masked versions of theLidataset\. When having only two or three runs/batch sizes per\(N,D\)\(N,D\),3TLstill results in essentially the same law, whereas the direct fit deviates\. Note that the direct fit only uses the train split, therefore the difference to the Step\-Law\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x5.png)\(a\)Dataset reduced to 42%
![Refer to caption](https://arxiv.org/html/2607.01487v1/x6.png)\(b\)Dataset reduced to 28%

Figure 3:Fitting on a reduced dataset, with only 3 values ofbbper sweep\(left\)and 2 values\(right\)\. Step\-Law can be considered the oracle law here, as it was fit on the unreduced dataset \(train plus val\)\. In both cases, the implied scaling ofM⋆M^\{\\star\}of the three\-term law stays close to Step\-Law, and generalizes better to largeDDthan the direct fit \(in*red*\)\. The gray dots mark the empirically best batch size for each\(N,D\)\(N,D\)on the reduced dataset \(for the train split\)\.
### 4\.3Performance with Suboptimal Batch Sizes

In practice, understanding how the*optimal*batch size scales withNNandDDmight not be enough\. In case of hardware constraints, it is mandatory to model the performance of models trained with*suboptimal*batch size\. The three\-term law form has the evident appeal that it also predicts model performance for suboptimal allocation of tokens into steps and batch size\.555This is similar to how the Chinchilla Approach 3 has the advantage that it describes suboptimal allocation of compute into model size or token budget\.In short, the goal of this section is to answer the following question:

> What is the interval of sub\-optimal batch sizes\[bmin,bmax\]\[b\_\{\\min\},b\_\{\\max\}\]such that at most 5% of compute is wasted, and how does it scale withDD?

##### Limitations of the three\-term law\.

We first evaluate the fitted3TLon a range of batch sizes, and compare the predicted loss values to the true ones \(see[Fig\.4](https://arxiv.org/html/2607.01487#S4.F4)\)\. While the optimal batch size is predicted well across all token budgets \([Fig\.4](https://arxiv.org/html/2607.01487#S4.F4), left\), the three\-term laws fails to accurately predict the loss value at the boundaries of the empirically covered range ofDD\([Fig\.4](https://arxiv.org/html/2607.01487#S4.F4), right\)\. This is not surprising: note that for the three\-term law, we fit 246 data points with seven parameters; therefore, we can not expect a perfect fit\.

inline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolorinline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolortodo:inline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolorCaveat:Due to its underparametrization, the three\-term law can not fit loss values accurately enough to robustly infer performance at suboptimal batch sizes\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x7.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x8.png)

Figure 4:N=268N=268M\.\(Left\)While the three\-term law \([3](https://arxiv.org/html/2607.01487#S3.E3)\) accurately predicts optimal batch size, its predicted*loss value*for very large/small token budgets deviates from the empirical value\.\(Right\)Empirical and predicted loss value across batch sizebb\. Again, for very large/small token budgets the accuracy of the three\-term law degrades\. Dashed border marks datapoints not used for fitting3TL\.
##### Fitting in two stages\.

To improve the fitting quality, we can fit a functional formℒ∼b\\mathcal\{L\}\\sim bonly on a subset of the data\. We use the three\-term law \([3](https://arxiv.org/html/2607.01487#S3.E3)\) as prior to choose such a functional form\. Assume some fixedNNandDDand fixed sequence lengthss\. Then, \([3](https://arxiv.org/html/2607.01487#S3.E3)\) simplifies to

ℒ=E\+ANα\+\(Bs−β\)b−β\+\(CD−γsγ\)bγ\.\\displaystyle\\mathcal\{L\}=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\(Bs^\{\-\\beta\}\)b^\{\-\\beta\}\+\(CD^\{\-\\gamma\}s^\{\\gamma\}\)b^\{\\gamma\}\.
Based on the parameters of the fitted3TL, we make the simplifying assumptionγ≈β\\gamma\\approx\\beta\(we will also see that this is sufficiently expressive to give an almost perfect fit\)\. Based on the above, we then fit the form

ℒ\(b\)=E~\+A~b−α~\+B~bα~\.\\displaystyle\\mathcal\{L\}\(b\)=\\tilde\{E\}\+\\tilde\{A\}b^\{\-\\tilde\{\\alpha\}\}\+\\tilde\{B\}b^\{\\tilde\{\\alpha\}\}\.\(8\)
As a first stage, we fit\(E~,A~,B~,α~\)\(\\tilde\{E\},\\tilde\{A\},\\tilde\{B\},\\tilde\{\\alpha\}\)from \([8](https://arxiv.org/html/2607.01487#S4.E8)\) separately for each\(N,D\)\(N,D\)\.666Here we perform a simple non\-linear least squares fit usingscipy\.optimize\.curve\_fit\.This two\-stage fitting procedure, where we reduced the number of parameters by assumingγ≈β\\gamma\\approx\\beta, has also been recommended in the survey ofLiet al\.\([2025b](https://arxiv.org/html/2607.01487#bib.bib15)\)\.

Now, let us define the notion ofε\\varepsilon\-suboptimal batch size:

###### Definition 1\.

Letε\>0\\varepsilon\>0, and letb⋆b^\{\\star\}be the minimizer of \([8](https://arxiv.org/html/2607.01487#S4.E8)\) \(for a fixedDDandNN\)\. Since \([8](https://arxiv.org/html/2607.01487#S4.E8)\) is unimodal inbb, we can define\[bmin,bmax\]\[b\_\{\\min\},b\_\{\\max\}\]to be the interval ofε\\varepsilon\-suboptimal batch sizes such thatℒ\(bmin\)=ℒ\(bmax\)=ℒ\(b⋆\)\+ε\\mathcal\{L\}\(b\_\{\\min\}\)=\\mathcal\{L\}\(b\_\{\\max\}\)=\\mathcal\{L\}\(b^\{\\star\}\)\+\\varepsilon\.

Here, we setε\\varepsilonto the loss difference from the law \([EpochAI](https://arxiv.org/html/2607.01487#S3.Ex3)\) evaluated at\(N,D\)\(N,D\)and\(N,0\.95⋅D\)\(N,0\.95\\cdot D\), that is, we allow a5%5\\%suboptimality in terms of compute\. From \([8](https://arxiv.org/html/2607.01487#S4.E8)\), we can then easily read off, for each\(N,D\)\(N,D\), the interval ofε\\varepsilon\-suboptimal batch sizes\.

As a second stage, we fit a power\-lawbmin/max=Υ/Dνb\_\{\\min/\\max\}=\\Upsilon/D^\{\\nu\}, where\(Υ,ν\)\(\\Upsilon,\\nu\)are fitted\. Here, for each model sizeNNwe keep the largest three token budgets as held\-out validation set, and only use values ofDDwhere the empirically optimal batch size does not lie on the boundary of the sweep\.

From[Fig\.5](https://arxiv.org/html/2607.01487#S4.F5), we observe that this two\-stage procedure – as expected – leads to a better fit across token budgetsDD\. In particular, the fitted power\-law onbmin/maxb\_\{\\min/\\max\}generalizes well beyond the token budgets used for fitting \([Fig\.5](https://arxiv.org/html/2607.01487#S4.F5), left\)\. See[SectionsA\.6\.2](https://arxiv.org/html/2607.01487#A1.SS6.SSS2)and[A\.6\.3](https://arxiv.org/html/2607.01487#A1.SS6.SSS3)for additional model sizes, and for theOpenEuroLLMdataset\. When averaging the intervals of suboptimal batch sizes\[bmin,bmax\]\[b\_\{\\min\},b\_\{\\max\}\]across model sizes \([Fig\.6](https://arxiv.org/html/2607.01487#S4.F6)\), we observe a slightly narrowing trend for theLidataset, and a relatively constant width forOpenEuroLLM; except for this, the picture is overall similar across the two datasets, suggesting that the scaling behavior of suboptimal batch sizes withDDis relatively consistent\.

inline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolorinline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolortodo:inline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolorTakeaway:The interval ofε\\varepsilon\-suboptimal batch sizes can be modeled with a two\-stage fitting procedure based from the three\-term law; the scaling behavior generalizes well and is fairly consistent across model sizes and training setups\. As rule of thumb, the interval of suboptimal batch sizes that corresponds to wasting at most 5% of compute has roughly a width of222^\{2\}\([Fig\.6](https://arxiv.org/html/2607.01487#S4.F6)\)\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x9.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x10.png)

Figure 5:N=268N=268M\.\(Left\)Batch size range\[bmin,bmax\]\[b\_\{\\min\},b\_\{\\max\}\]withε\\varepsilon\-suboptimal loss derived from law \([8](https://arxiv.org/html/2607.01487#S4.E8)\) \(withε\\varepsilonsuch that less than5%5\\%compute is wasted\)\. Shaded area is obtained from fitting a power\-law on the values ofbmin/maxb\_\{\\min/\\max\}in\-sample \(solid lines\)\.\(Right\)Empirical and predicted loss value across batch sizebb\. Here, the predicted values are from the law \([8](https://arxiv.org/html/2607.01487#S4.E8)\), fitted separately for eachDD\. Black dotted lines markbmin/maxb\_\{\\min/\\max\}used for fitting the power\-lawbmin/max∝Dνb\_\{\\min/\\max\}\\propto D^\{\\nu\}on the left\. Plots for other model sizes in[SectionA\.6\.2](https://arxiv.org/html/2607.01487#A1.SS6.SSS2)\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x11.png)\(a\)Li\(sequence length20482048\)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x12.png)\(b\)OpenEuroLLM\(sequence length40964096\)

Figure 6:Scaling ofε\\varepsilon\-suboptimal batch size across model sizes, forLidataset\(left\)andOpenEuroLLMdataset\(right\)\. The scaling of suboptimal batch sizes\[bmin,bmax\]\[b\_\{\\min\},b\_\{\\max\}\]\(grey area\) is relatively consistent across the two datasets, after accounting for a factor of two due to the different sequence length\.

### 4\.4Three\-term Law and Critical Batch Size

The notion of*critical batch size*can be defined as follows: for a fixed target lossℒ¯\\bar\{\\mathcal\{L\}\}, letKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)be the number of steps to reach lossℒ¯\\bar\{\\mathcal\{L\}\}, as a function of the batch sizebb\. As explained in the introduction,McCandlishet al\.\([2018](https://arxiv.org/html/2607.01487#bib.bib17)\)show thatKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)decreases at much slower rate than inverse\-linearly beyond a critical value ofbb, the so\-called*critical batch size*\. This has an important practical consequence: training at the highest practically feasible batch size can be suboptimal if it exceeds the critical batch size\.

Empirically,Zhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26)\)show that \(i\) critical batch size scales with compute under Chinchilla\-optimal scaling of\(N,D\)\(N,D\), and \(ii\) this increase comes mostly from scaling up the token budgetDD\. In particular, whenDDis fixed, the functionKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)is roughly the same across model sizesNN\.777Note thatZhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26)\)operate in a slightly non\-standard setup, as they use constant learning\-rate schedule with weight averaging \(vs\. cosine schedule used inLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)\)\.Furthermore,Zhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26)\)derive a theoretical model of critical batch size that captures the phenomena described above, however only for the very restricted setting of least\-squares problems in the infinite\-width limit\.

Here, we show that the three\-term law \([3](https://arxiv.org/html/2607.01487#S3.E3)\) can equally well describe the behaviour ofKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)when scalingDDand/orNN\. For this, fix a target lossℒ¯\\bar\{\\mathcal\{L\}\}and denoteE~\(N\):=E\+ANα\\tilde\{E\}\(N\):=E\+\\frac\{A\}\{N^\{\\alpha\}\}\. From \([3](https://arxiv.org/html/2607.01487#S3.E3)\), ifℒ¯\>E~\(N\)\+CBMβ\\bar\{\\mathcal\{L\}\}\>\\tilde\{E\}\(N\)\+\\frac\{CB\}\{M^\{\\beta\}\}, the number of steps to reachℒ¯\\bar\{\\mathcal\{L\}\}is given by

Kℒ¯\(b\)=\[ℒ¯−E~\(N\)C−BMβ\]−1γ=\[ℒ¯−E~\(N\)C−BCsβbβ\]−1γ\.\\displaystyle K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)=\\Big\[\\frac\{\\bar\{\\mathcal\{L\}\}\-\\tilde\{E\}\(N\)\}\{C\}\-\\frac\{B\}\{M^\{\\beta\}\}\\Big\]^\{\-\\frac\{1\}\{\\gamma\}\}=\\Big\[\\frac\{\\bar\{\\mathcal\{L\}\}\-\\tilde\{E\}\(N\)\}\{C\}\-\\frac\{B\}\{Cs^\{\\beta\}b^\{\\beta\}\}\\Big\]^\{\-\\frac\{1\}\{\\gamma\}\}\.\(9\)
We replicate the setting ofZhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26), Figure 1\), but plugging into3TL:

1. \(A1\)Scale up bothNNandDDin the Chinchilla\-optimal setting\.
2. \(A2\)FixD=3\.07D=3\.07B and varyNNfrom8585M to1\.21\.2B\.
3. \(A3\)FixN=302N=302M and varyDDin\[0\.25,4\]\[0\.25,4\]times the Chinchilla\-optimalDD\.

For all three settings, given\(N,D\)\(N,D\)we compute the target lossℒ¯\\bar\{\\mathcal\{L\}\}from the law \([EpochAI](https://arxiv.org/html/2607.01487#S3.Ex3)\)\. Then, we computeKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)according to \([9](https://arxiv.org/html/2607.01487#S4.E9)\), using the parameters of3TLpreviously fitted on theLidataset\.[Fig\.7](https://arxiv.org/html/2607.01487#S4.F7)confirms that the functionKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)is almost invariant as we scale up model sizeNN, but changes significantly if we scale up the token budgetDD\.

inline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolorinline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolortodo:inline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolorTakeaway:Under the three\-term law, the number of stepsKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)to reach a target lossℒ¯\\bar\{\\mathcal\{L\}\}, as a function of the batch size, is mostly invariant to scaling upNN, but not to scaling upDD\. This matches the empirical results ofZhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26)\)\.As a consequence, the three\-term law is a suitable model forKℒ¯\(b\)K\_\{\\bar\{\\mathcal\{L\}\}\}\(b\)at large batch sizes, while*at the same time*allowing for a non\-trivial optimal batch size\. This is in contrast to the theoretical models byMcCandlishet al\.\([2018](https://arxiv.org/html/2607.01487#bib.bib17)\); Bergsmaet al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib2)\)andZhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26)\), which can describe the critical batch size, but imply that the optimal batch size is one\.

##### Comparison to other related models\.

We would like to mention a different approach byvon Rütteet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib19), Section 4\.5\), which modifies the model ofMcCandlishet al\.\([2018](https://arxiv.org/html/2607.01487#bib.bib17)\)such that it also allows for optimal batch sizes larger than one\. They propose the equation

\(\[KKmin\]α−1\)\(\[bbmin\]α−1\)=1\.\\displaystyle\\Big\(\\Big\[\\frac\{K\}\{K\_\{\\min\}\}\\Big\]^\{\\alpha\}\-1\\Big\)\\Big\(\\Big\[\\frac\{b\}\{b\_\{\\min\}\}\\Big\]^\{\\alpha\}\-1\\Big\)=1\.\(10\)In comparison to the three\-term law, the main conceptual difference is the setup of objective and constraints:von Rütteet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib19)\)fix a target lossℒ¯\\bar\{\\mathcal\{L\}\}, and minimizeD=bsKD=bsKsuch thatℒ¯\\bar\{\\mathcal\{L\}\}is reached, subject to \([10](https://arxiv.org/html/2607.01487#S4.E10)\); for3TL, we fixDD, and minimize the final loss with respect tobbsubject toD=bsKD=bsK\. Given that for a concrete training run it is much easier to fixDDthan a target loss, the latter seems to be the more practicable approach\.

![Refer to caption](https://arxiv.org/html/2607.01487v1/x13.png)\(a\)Chinchilla\-optimal
![Refer to caption](https://arxiv.org/html/2607.01487v1/x14.png)\(b\)Fixed dataD=3\.07D=3\.07B
![Refer to caption](https://arxiv.org/html/2607.01487v1/x15.png)\(c\)Fixed modelN=302N=302M

Figure 7:Under the three\-term law, critical batch size changes with token budgetDD\(right\), but is almost invariant to changes of model sizeNN\(middle\)\. This matches empirical findings, cf\.Zhanget al\.\([2025](https://arxiv.org/html/2607.01487#bib.bib26), Figure 1\)\.

### 4\.5Back to the Chinchilla Form

Having fitted the3TL, we can compare its batch\-size\-optimal reduction from \([5](https://arxiv.org/html/2607.01487#S3.E5)\) to a Chinchilla\-type law\. To do so, we first fit the form \([1](https://arxiv.org/html/2607.01487#S1.E1)\) to the runs fromLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)\(only using optimal batch sizes\); we use the same fitting procedure except that we setδ=10−5\\delta=10^\{\-5\}, which leads to a more stable fit\. See[SectionA\.1](https://arxiv.org/html/2607.01487#A1.SS1)in the Appendix for more details and ablations of this choice\.

Comparing this to the fitted parameters of the3TL\(see[Table3](https://arxiv.org/html/2607.01487#S4.T3)\), we can already see a rather big discrepancy; for example, with3TLwe obtain a much smaller value ofτ\\tau\(compared toβ\\beta\) as well asE≈0E\\approx 0\.

Instead of simply comparing the parameter values, we can also compare the implied scaling behavior of both laws\. In particular, the main goal of Chinchilla scaling laws is to determine how the optimal model size scales with compute𝒞\\mathcal\{C\}\. Assuming𝒞=6ND\\mathcal\{C\}=6ND, from \([1](https://arxiv.org/html/2607.01487#S1.E1)\) we getN⋆=\[αAβB\]1α\+β\(𝒞6\)βα\+βN^\{\\star\}=\\Big\[\\frac\{\\alpha A\}\{\\beta B\}\\Big\]^\{\\frac\{1\}\{\\alpha\+\\beta\}\}\\big\(\\frac\{\\mathcal\{C\}\}\{6\}\\big\)^\{\\frac\{\\beta\}\{\\alpha\+\\beta\}\}\. For the three\-term law, using \([5](https://arxiv.org/html/2607.01487#S3.E5)\), we replace\(β,B\)→\(τ,B^\)\(\\beta,B\)\\to\(\\tau,\\hat\{B\}\)\.[Fig\.8](https://arxiv.org/html/2607.01487#S4.F8)shows that the implied compute\-optimal scaling ofN⋆N^\{\\star\}overlaps only for a relatively small interval of compute𝒞\\mathcal\{C\}\.

inline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolorinline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolortodo:inline,bordercolor=caveatcolor,backgroundcolor=caveatcolor\!20,linecolor=caveatcolorCaveat:The three\-term law*can*be reduced to a Chinchilla\-type law, however, its implied compute\-optimal scaling is quite different to a direct fit of \([1](https://arxiv.org/html/2607.01487#S1.E1)\)\. In particular, the implied scaling inDDis much smaller\. This suggests that3TLis not the most reliable instrument to describe compute\-optimal allocation ofNNandDD\.This confirms the finding ofLiet al\.\([2025b](https://arxiv.org/html/2607.01487#bib.bib15)\), that the exact formulation of the scaling law can already impact the implied optimal model size\.

Table 3:Fitted parameter values when fitting a law of form \([1](https://arxiv.org/html/2607.01487#S1.E1)\) toLidataset, and for3TLafter the reduction \([5](https://arxiv.org/html/2607.01487#S3.E5)\)\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x16.png)
Figure 8:Compute\-optimal model size\.

## 5Limitations

We have already addressed some caveats in the discussions above\. Here, we summarize the main limitations of the presented approach and how they could be resolved in future work:

- •While we have shown that the three\-term law can be robustly fit for two different datasets \(LiandOpenEuroLLM\), the quantitative results can be inconsistent to a degree which is minor \(e\.g\. for optimal batch size scaling\) or moderately high \(e\.g\. impact of model size\)\. It is not clear how well the reported scaling laws generalize to other training setups or tasks\. Further, although more sophisticated scaling law formulations can in principle collapse back to the Chinchilla form, the resulting scaling can be quite different \([Section4\.5](https://arxiv.org/html/2607.01487#S4.SS5)\)\.
- •While the three\-term law explicitly models the batch size, we still need the optimal learning rate for each single combination of\(N,D,b\)\(N,D,b\); thus, despite our finding that the required amount of training runs can be reduced, the absolute number is still huge \(before selecting the optimal learning rate, theLidataset contains roughly 3000 runs\)\. An interesting direction for future work would be to introduce the learning rate in the three\-term law, possibly inspired again by the theoretical results fromShulginet al\.\([2026](https://arxiv.org/html/2607.01487#bib.bib23)\)or similar works\. However, given the previous limitation, it is unlikely that adding additional terms will alleviate the issue of consistency\.
- •As we have seen, the three\-term law alone is not predictive enough to infer the interval ofε\\varepsilon\-suboptimal batch sizes\. For the two\-stage procedure we propose instead, we still require a relatively fine\-grained batch size sweep\.
- •Optimal batch size scaling might be optimizer\-dependent; in particular, it has been shown that theMuonoptimizer\(Jordanet al\.,[2024](https://arxiv.org/html/2607.01487#bib.bib32)\)allows for larger batch sizes\(Essential AIet al\.,[2025](https://arxiv.org/html/2607.01487#bib.bib31)\)\. Investigating how switching the optimizer affects the fitted three\-term law remains future work\.

## 6Conclusion

We have proposed a three\-term scaling law that takes into account model size, training steps and batch size; the latter two explicitly model how the total amount of tokens is allocated\. This formulation has natural advantages, bringing together Chinchilla\-type and hyperparameter scaling laws, as well as tying it closely to theoretical results in stochastic optimization\.

On a practical side, we have shown that our proposed law can be robustly fit even with incomplete batch size sweeps, thus largely reducing the number of training runs necessary to obtain scaling laws for the optimal batch size\.

Second, our approach naturally allows to model suboptimal batch sizes, and we have derived their scaling with the total data budget\. Finally, we have shown that the three\-term law, in contrast to previous proposals, correctly describes the phenomenon of critical batch size, while at the same time allowing for non\-trivial optimal batch size\.

## Acknowledgments

Fabian Schaipp is supported by the French government under the management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR\-19\-P3IA\-0001 \(PRAIRIE 3IA Institute\), and the European Research Council Starting Grant DYNASTY – 101039676\.

First, we would like to thank the authors ofLiet al\.\([2025a](https://arxiv.org/html/2607.01487#bib.bib14)\)for making all of their training runs public; without their dataset, this article would not have not been realized\.

Second, many thanks go to Niccolò Ajroldi for compiling and providing access to theOpenEuroLLMdataset, and to both Niccolò Ajroldi and Antonio Orvieto for their feedback and suggestions, which inspired some of the ideas of the paper\.

Furthermore, this paper has benefited from discussions with Francis Bach, Alexander Hägele, Frederik Kunstner, Umut Şimşekli, and Adrien Taylor\.

## References

- Power lines: scaling laws for weight decay and batch size in LLM pre\-training\.InAdvances in Neural Information Processing Systems,Vol\.38,pp\. 125153–125188\.Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p2.4),[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p3.1),[§3](https://arxiv.org/html/2607.01487#S3.p2.1),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.p6.1),[Table 1](https://arxiv.org/html/2607.01487#S4.T1.2.2.2)\.
- T\. Besiroglu, E\. Erdil, M\. Barnett, and J\. You \(2024\)Chinchilla scaling: a replication attempt\.External Links:2404\.10102Cited by:[§A\.1](https://arxiv.org/html/2607.01487#A1.SS1.SSS0.Px2.p2.2),[§A\.1](https://arxiv.org/html/2607.01487#A1.SS1.SSS0.Px3.p1.2),[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px1.p1.8),[§3](https://arxiv.org/html/2607.01487#S3.p1.3)\.
- C\. Bodnar, W\. P\. Bruinsma, A\. Lucic, M\. Stanley, A\. Vaughan, J\. Brandstetter, P\. Garvan, M\. Riechert, J\. A\. Weyn, H\. Dong, J\. K\. Gupta, K\. Thambiratnam, A\. T\. Archibald, C\. Wu, E\. Heider, M\. Welling, R\. E\. Turner, and P\. Perdikaris \(2024\)A foundation model for the earth system\.External Links:2405\.13063Cited by:[§1](https://arxiv.org/html/2607.01487#S1.p1.1)\.
- DeepSeek\-AI, X\. Bi, D\. Chen, G\. Chen, S\. Chen, D\. Dai, C\. Deng, H\. Ding, K\. Dong, Q\. Du, Z\. Fu, H\. Gao, K\. Gao, W\. Gao, R\. Ge, K\. Guan, D\. Guo, J\. Guo, G\. Hao, Z\. Hao, Y\. He, W\. Hu, P\. Huang, E\. Li, G\. Li, J\. Li, Y\. Li, Y\. K\. Li, W\. Liang, F\. Lin, A\. X\. Liu, B\. Liu, W\. Liu, X\. Liu, X\. Liu, Y\. Liu, H\. Lu, S\. Lu, F\. Luo, S\. Ma, X\. Nie, T\. Pei, Y\. Piao, J\. Qiu, H\. Qu, T\. Ren, Z\. Ren, C\. Ruan, Z\. Sha, Z\. Shao, J\. Song, X\. Su, J\. Sun, Y\. Sun, M\. Tang, B\. Wang, P\. Wang, S\. Wang, Y\. Wang, Y\. Wang, T\. Wu, Y\. Wu, X\. Xie, Z\. Xie, Z\. Xie, Y\. Xiong, H\. Xu, R\. X\. Xu, Y\. Xu, D\. Yang, Y\. You, S\. Yu, X\. Yu, B\. Zhang, H\. Zhang, L\. Zhang, L\. Zhang, M\. Zhang, M\. Zhang, W\. Zhang, Y\. Zhang, C\. Zhao, Y\. Zhao, S\. Zhou, S\. Zhou, Q\. Zhu, and Y\. Zou \(2024\)DeepSeek LLM: scaling open\-source language models with longtermism\.External Links:2401\.02954Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p1.4),[Table 1](https://arxiv.org/html/2607.01487#S4.T1.5.5.4)\.
- Essential AI, I\. Shah, A\. M\. Polloreno, K\. Stratos, P\. Monk, A\. Chaluvaraju, A\. Hojel, A\. Ma, A\. Thomas, A\. Tanwer, D\. J\. Shah, K\. Nguyen, K\. Smith, M\. Callahan, M\. Pust, M\. Parmar, P\. Rushton, P\. Mazarakis, R\. Kapila, S\. Srivastava, S\. Singla, T\. Romanski, Y\. Vanjani, and A\. Vaswani \(2025\)Practical efficiency of Muon for pretraining\.External Links:2505\.02222Cited by:[4th item](https://arxiv.org/html/2607.01487#S5.I1.i4.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, O\. Vinyals, J\. Rae, and L\. Sifre \(2022\)An empirical analysis of compute\-optimal large language model training\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 30016–30030\.Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2607.01487#S1.p1.1),[§3](https://arxiv.org/html/2607.01487#S3.p1.2),[§3](https://arxiv.org/html/2607.01487#S3.p1.3),[§3](https://arxiv.org/html/2607.01487#S3.p2.1)\.
- R\. Islamov, R\. Machacek, A\. Lucchi, A\. Silveti\-Falls, E\. Gorbunov, and V\. Cevher \(2026\)On the role of batch size in stochastic conditional gradient methods\.External Links:2603\.21191Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2607.01487#S3.SS0.SSS0.Px1.p2.6)\.
- K\. Jordan, Y\. Jin, V\. Boza, Y\. Jiacheng, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024\)Muon: an optimizer for hidden layers in neural networks\.Note:Blog postExternal Links:[Link](https://kellerjordan.github.io/posts/muon/)Cited by:[4th item](https://arxiv.org/html/2607.01487#S5.I1.i4.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.External Links:2001\.08361Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px1.p1.6),[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p2.4),[§1](https://arxiv.org/html/2607.01487#S1.p1.1)\.
- D\. Kovalev \(2025\)Understanding gradient orthogonalization for deep learning via non\-euclidean trust\-region optimization\.External Links:2503\.12645Cited by:[item \(ii\)](https://arxiv.org/html/2607.01487#S2.I1.i2.p1.1),[§3](https://arxiv.org/html/2607.01487#S3.SS0.SSS0.Px1.p2.6)\.
- H\. Li, W\. Zheng, J\. Hu, Q\. Wang, H\. Zhang, Z\. Wang, S\. Xuyang, Y\. Fan, S\. Zhou, X\. Zhang, and D\. Jiang \(2025a\)Predictable scale: part i – optimal hyperparameter scaling law in large language model pretraining\.External Links:2503\.04715Cited by:[itemLi](https://arxiv.org/html/2607.01487#A1.I1.ix1.p1.1),[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p1.4),[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p3.1),[item \(ii\)](https://arxiv.org/html/2607.01487#S2.I2.i2.p1.2),[§2](https://arxiv.org/html/2607.01487#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2607.01487#S3.SS0.SSS0.Px2.p1.4),[§4\.1](https://arxiv.org/html/2607.01487#S4.SS1.p3.2),[§4\.1](https://arxiv.org/html/2607.01487#S4.SS1.p4.2),[§4\.2](https://arxiv.org/html/2607.01487#S4.SS2.p1.3),[§4\.5](https://arxiv.org/html/2607.01487#S4.SS5.p1.1),[Table 1](https://arxiv.org/html/2607.01487#S4.T1.1.1.2),[Table 2](https://arxiv.org/html/2607.01487#S4.T2.10.10.1),[Acknowledgments](https://arxiv.org/html/2607.01487#Sx1.p2.1),[footnote 4](https://arxiv.org/html/2607.01487#footnote4),[footnote 7](https://arxiv.org/html/2607.01487#footnote7)\.
- M\. Li, S\. Kudugunta, and L\. Zettlemoyer \(2025b\)\(Mis\)fitting scaling laws: A survey of scaling law fitting techniques in deep learning\.InInternational Conference on Learning Representations,Cited by:[§A\.5](https://arxiv.org/html/2607.01487#A1.SS5.p1.5),[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px1.p1.8),[§4\.1](https://arxiv.org/html/2607.01487#S4.SS1.p1.1),[§4\.3](https://arxiv.org/html/2607.01487#S4.SS3.SSS0.Px2.p3.3),[§4\.5](https://arxiv.org/html/2607.01487#S4.SS5.p4.1)\.
- Z\. Lin, H\. Akin, R\. Rao, B\. Hie, Z\. Zhu, W\. Lu, N\. Smetanin, R\. Verkuil, O\. Kabeli, Y\. Shmueli, A\. dos Santos Costa, M\. Fazel\-Zarandi, T\. Sercu, S\. Candido, and A\. Rives \(2023\)Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.Science379\(6637\),pp\. 1123–1130\.External Links:https://www\.science\.org/doi/pdf/10\.1126/science\.ade2574Cited by:[§1](https://arxiv.org/html/2607.01487#S1.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2607.01487#S4.p1.4)\.
- S\. McCandlish, J\. Kaplan, D\. Amodei, and O\. D\. Team \(2018\)An empirical model of large\-batch training\.External Links:1812\.06162Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p2.1),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.SSS0.Px1.p1.8),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.p1.6),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.p6.1)\.
- \[16\]OpenEuroLLM ConsortiumA dataset of LLM training runs\.Cited by:[itemOpenEuroLLM](https://arxiv.org/html/2607.01487#A1.I1.ix2.p1.1),[§2](https://arxiv.org/html/2607.01487#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Porian, M\. Wortsman, J\. Jitsev, L\. Schmidt, and Y\. Carmon \(2024\)Resolving discrepancies in compute\-optimal scaling of language models\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 100535–100570\.Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p3.1),[§4\.1](https://arxiv.org/html/2607.01487#S4.SS1.SSS0.Px1.p1.3)\.
- F\. Schaipp, A\. Hägele, A\. Taylor, U\. Simsekli, and F\. Bach \(2025\)The surprising agreement between convex optimization theory and learning\-rate scheduling for large model training\.InInternational Conference on Machine Learning,Vol\.267,pp\. 53267–53294\.Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px3.p1.1)\.
- C\. J\. Shallue, J\. Lee, J\. M\. Antognini, J\. N\. Sohl\-Dickstein, R\. Frostig, and G\. E\. Dahl \(2018\)Measuring the effects of data parallelism on neural network training\.J\. Mach\. Learn\. Res\.20,pp\. 112:1–112:49\.Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p2.1)\.
- E\. Shulgin, D\. von Rütte, T\. H\. Zhang, N\. Ajroldi, B\. Schölkopf, and A\. Orvieto \(2026\)Deriving hyperparameter scaling laws via modern optimization theory\.External Links:2603\.15958Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px3.p1.1),[item \(ii\)](https://arxiv.org/html/2607.01487#S2.I1.i2.p1.1),[§3](https://arxiv.org/html/2607.01487#S3.SS0.SSS0.Px1.p2.6),[2nd item](https://arxiv.org/html/2607.01487#S5.I1.i2.p1.1)\.
- D\. von Rütte, J\. Fluri, O\. Pooladzandi, B\. Schölkopf, T\. Hofmann, and A\. Orvieto \(2026\)Scaling behavior of discrete diffusion language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px2.p1.4),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.SSS0.Px1.p1.7),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.SSS0.Px1.p1.8)\.
- X\. Wang and L\. Aitchison \(2025\)How to set AdamW’s weight decay as you scale model and dataset size\.InInternational Conference on Machine Learning,Vol\.267,pp\. 62222–62250\.Cited by:[§1](https://arxiv.org/html/2607.01487#S1.SS0.SSS0.Px3.p1.1)\.
- X\. Zhai, A\. Kolesnikov, N\. Houlsby, and L\. Beyer \(2022\)Scaling vision transformers\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 12104–12113\.Cited by:[§1](https://arxiv.org/html/2607.01487#S1.p1.1)\.
- H\. Zhang, D\. Morwani, N\. Vyas, J\. Wu, D\. Zou, U\. Ghai, D\. P\. Foster, and S\. M\. Kakade \(2025\)How does critical batch size scale in pre\-training?\.InInternational Conference on Learning Representations,Cited by:[item \(ii\)](https://arxiv.org/html/2607.01487#S2.I2.i2.p1.2),[§3](https://arxiv.org/html/2607.01487#S3.SS0.SSS0.Px2.p1.4),[§3](https://arxiv.org/html/2607.01487#S3.p2.1),[Figure 7](https://arxiv.org/html/2607.01487#S4.F7),[Figure 7](https://arxiv.org/html/2607.01487#S4.F7.4.2),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.p2.5),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.p4.1),[§4\.4](https://arxiv.org/html/2607.01487#S4.SS4.p6.1),[footnote 7](https://arxiv.org/html/2607.01487#footnote7),[ToDo inline,bordercolor=takeawaycolor,backgroundcolor=takeawaycolor\!20,linecolor=takeawaycolor](https://arxiv.org/html/2607.01487#todox5)\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2607.01487#S1)
2. [2Overview](https://arxiv.org/html/2607.01487#S2)
3. [3Scaling Laws with Training Steps and Batch Size](https://arxiv.org/html/2607.01487#S3)
4. [4Experiments](https://arxiv.org/html/2607.01487#S4)1. [4\.1Comparison of Approach I and II](https://arxiv.org/html/2607.01487#S4.SS1) 2. [4\.2Compute Savings Using the Three\-term Law](https://arxiv.org/html/2607.01487#S4.SS2) 3. [4\.3Performance with Suboptimal Batch Sizes](https://arxiv.org/html/2607.01487#S4.SS3) 4. [4\.4Three\-term Law and Critical Batch Size](https://arxiv.org/html/2607.01487#S4.SS4) 5. [4\.5Back to the Chinchilla Form](https://arxiv.org/html/2607.01487#S4.SS5)
5. [5Limitations](https://arxiv.org/html/2607.01487#S5)
6. [6Conclusion](https://arxiv.org/html/2607.01487#S6)
7. [References](https://arxiv.org/html/2607.01487#bib)
8. [AExperiments: Supplementary Material](https://arxiv.org/html/2607.01487#A1)1. [A\.1Details on Experimental Setup](https://arxiv.org/html/2607.01487#A1.SS1) 2. [A\.2Additional Observations](https://arxiv.org/html/2607.01487#A1.SS2) 3. [A\.3Scaling Law Coefficients](https://arxiv.org/html/2607.01487#A1.SS3) 4. [A\.4Fitting2TLand3TLonOpenEuroLLMDataset](https://arxiv.org/html/2607.01487#A1.SS4) 5. [A\.5Ablation on Value ofδ\\delta](https://arxiv.org/html/2607.01487#A1.SS5) 6. [A\.6Additional Plots](https://arxiv.org/html/2607.01487#A1.SS6)1. [A\.6\.1LiDataset Overview](https://arxiv.org/html/2607.01487#A1.SS6.SSS1) 2. [A\.6\.2Suboptimal Batch Size Scaling:Li](https://arxiv.org/html/2607.01487#A1.SS6.SSS2) 3. [A\.6\.3Suboptimal Batch Size Scaling:OpenEuroLLM](https://arxiv.org/html/2607.01487#A1.SS6.SSS3) 4. [A\.6\.4Reduced Dataset Fit forOpenEuroLLM](https://arxiv.org/html/2607.01487#A1.SS6.SSS4)

## Appendix AExperiments: Supplementary Material

### A\.1Details on Experimental Setup

##### Datasets\.

Below is a short description and source for the two main datasets we use in the analysis\. Within each dataset, the sequence length is the same across runs \(20482048forLi, and40964096forOpenEuroLLM\)\. For the scaling laws, we select for each combination of\(N,b,D\)\(N,b,D\)\(model size, batch size, token budget\) the learning rate that obtains smallest final loss\.

Li\(Liet al\.,[2025a](https://arxiv.org/html/2607.01487#bib.bib14)\) Training logs for different model sizes, token budgets, batch sizes, and learning rates\. We use their smoothened loss and filter on dense models \(no MoEs\)\.

OpenEuroLLM\([OpenEuroLLM Consortium,](https://arxiv.org/html/2607.01487#bib.bib30)\) This is an unpublished dataset of training runs executed by the[OpenEuroLLM initiative](https://arxiv.org/html/2607.01487v1/www.openeurollm.eu)\. Additional details on the training setup will be provided upon its release by the OpenEuroLLM initiative\.

Table 4:Overview of the datasets used for fitting our scaling laws\. Here, we report the ranges of\(N,M,K\)\(N,M,K\)on the union of train and validation set\. See also[Fig\.13](https://arxiv.org/html/2607.01487#A1.F13)\.
##### Fitting methodology\.

We first describe the procedure for fitting the scaling laws\. The training set is split into five parts of equal size\. For each law, we then fit the same law on each leave\-one\-out split of datapoints \(five\-fold cross\-validation\)\. This allows for more robustness to outliers and to quantify uncertainty for each fitted parameter\.

For the fitting procedure, we use the same Huber loss function asBesirogluet al\.\([2024](https://arxiv.org/html/2607.01487#bib.bib3)\): for true loss valuesℒtrue\\mathcal\{L\}\_\{\\text\{true\}\}and predicted loss valuesℒ^\\hat\{\\mathcal\{L\}\}, we minimize

∑iℋδ\(log\(ℒtrue\(i\)\)−log\(ℒ^\(i\)\),ℋδ\(z\)=\{12δ2,\|z\|≤δ,δ\(\|z\|−12δ\),else\.\\displaystyle\\sum\_\{i\}\\mathcal\{H\}\_\{\\delta\}\\big\(\\log\(\\mathcal\{L\}^\{\(i\)\}\_\{\\text\{true\}\}\)\-\\log\(\\hat\{\\mathcal\{L\}\}^\{\(i\)\}\\big\),\\quad\\mathcal\{H\}\_\{\\delta\}\(z\)=\\begin\{cases\}\\frac\{1\}\{2\}\\delta^\{2\},\\quad&\|z\|\\leq\\delta,\\\\ \\delta\(\|z\|\-\\frac\{1\}\{2\}\\delta\),\\quad&\\text\{else\}\.\\end\{cases\}We use theminimizemethod fromscipy\.optimizetogether with theL\-BFGS\-Boptimizer\.

##### Choice ofδ\\delta\.

For the Huber loss, we setδ=10−3\\delta=10^\{\-3\}for the2TLand3TLscaling laws, which is the standard value used also byBesirogluet al\.\([2024](https://arxiv.org/html/2607.01487#bib.bib3)\)\. We run an ablation on this choice, settingδ=10−5\\delta=10^\{\-5\}instead, see[SectionA\.5](https://arxiv.org/html/2607.01487#A1.SS5)for details\.

##### Initialization\.

For each single fit, we minimize the Huber loss atninitn\_\{\\text\{init\}\}different initializations withL\-BFGS, and select the solution that results in the smallest objective function\. By default, we use a grid of ten values for each of\(α,β,γ,E\)\(\\alpha,\\beta,\\gamma,E\)and two values for each of\(A,B,C\)\(A,B,C\)\. That is, for a2TL\(see \([6](https://arxiv.org/html/2607.01487#S3.E6)\)\) we haveninit=103⋅22=4000n\_\{\\text\{init\}\}=10^\{3\}\\cdot 2^\{2\}=4000; for a3TL\(see \([3](https://arxiv.org/html/2607.01487#S3.E3)\)\) this becomes computationally intensive, and we therefore randomly select50005000starting points from the grid\.

##### Evaluation\.

After fitting, we predict loss values by averaging over the predictions of each of the five individually fit models \(*cross\-validation ensemble*\)\. In[Tables5](https://arxiv.org/html/2607.01487#A1.T5)and[6](https://arxiv.org/html/2607.01487#A1.T6), we report the in\-sample mean\-absolute deviation \(MAD\) of the predicted and true loss values\.

### A\.2Additional Observations

This section lists changes of the fitting technique that we have tried \(usually only on theLidataset\), which however do not lead to better or significantly different results\.

1. \(I\)Due to the almost\-zero values of the parameterEEin our laws, we try to enforce larger values by adding the regularization termλ2\(log⁡E\)2\\frac\{\\lambda\}\{2\}\(\\log\{E\}\)^\{2\}\. Withλ=10−3\\lambda=10^\{\-3\}, this leads to different coefficients, in particularE=0\.729E=0\.729; in terms of the other analyses, for example optimal batch size scaling, this has no major impact\.
2. \(II\)Providing the gradient of \([3](https://arxiv.org/html/2607.01487#S3.E3)\) to theL\-BFGSsolver has no major impact on results\.
3. \(III\)UsingBFGSinstead ofL\-BFGShas no major impact on results \(while being significantly slower for fitting\)\.

### A\.3Scaling Law Coefficients

Table 5:Lidataset: Scaling law coefficients for2TL\(see \([6](https://arxiv.org/html/2607.01487#S3.E6)\)\) across model sizes, as well as for3TL\(see \([3](https://arxiv.org/html/2607.01487#S3.E3)\)\)\.2TLhas no parameters\(A,α\)\(A,\\alpha\), see \([6](https://arxiv.org/html/2607.01487#S3.E6)\)\. We report average value across subsampled fits, with the standard deviation in brackets\. MAD refers to the in\-sample mean absolute deviation of the predicted \(with cross\-validation ensemble\) to the true loss values\.Model sizeNNEEAABBCCα\\alphaβ\\betaγ\\gammaSamplesMAD60M0\.705\(0\.31\)\\underset\{\(0\.31\)\}\{0\.705\}\-4\.56\(0\.15\)\\underset\{\(0\.15\)\}\{4\.56\}7\.26\(1\.4\)\\underset\{\(1\.4\)\}\{7\.26\}\-0\.0864\(0\.018\)\\underset\{\(0\.018\)\}\{0\.0864\}0\.278\(0\.029\)\\underset\{\(0\.029\)\}\{0\.278\}130\.00802120M0\.0395\(0\.039\)\\underset\{\(0\.039\)\}\{0\.0395\}\-5\.24\(0\.22\)\\underset\{\(0\.22\)\}\{5\.24\}8\.22\(1\.7\)\\underset\{\(1\.7\)\}\{8\.22\}\-0\.0752\(0\.0082\)\\underset\{\(0\.0082\)\}\{0\.0752\}0\.274\(0\.044\)\\underset\{\(0\.044\)\}\{0\.274\}130\.0089215M0\.481\(0\.38\)\\underset\{\(0\.38\)\}\{0\.481\}\-4\.66\(0\.44\)\\underset\{\(0\.44\)\}\{4\.66\}5\.01\(0\.6\)\\underset\{\(0\.6\)\}\{5\.01\}\-0\.0983\(0\.033\)\\underset\{\(0\.033\)\}\{0\.0983\}0\.204\(0\.028\)\\underset\{\(0\.028\)\}\{0\.204\}400\.0123268M0\.959\(0\.17\)\\underset\{\(0\.17\)\}\{0\.959\}\-4\.04\(0\.3\)\\underset\{\(0\.3\)\}\{4\.04\}4\.33\(0\.23\)\\underset\{\(0\.23\)\}\{4\.33\}\-0\.113\(0\.019\)\\underset\{\(0\.019\)\}\{0\.113\}0\.201\(0\.011\)\\underset\{\(0\.011\)\}\{0\.201\}600\.0143429M0\.763\(0\.22\)\\underset\{\(0\.22\)\}\{0\.763\}\-5\.21\(0\.73\)\\underset\{\(0\.73\)\}\{5\.21\}3\.47\(0\.25\)\\underset\{\(0\.25\)\}\{3\.47\}\-0\.151\(0\.027\)\\underset\{\(0\.027\)\}\{0\.151\}0\.133\(0\.018\)\\underset\{\(0\.018\)\}\{0\.133\}550\.0115537M0\.198\(0\.1\)\\underset\{\(0\.1\)\}\{0\.198\}\-5\.05\(0\.55\)\\underset\{\(0\.55\)\}\{5\.05\}3\.67\(0\.1\)\\underset\{\(0\.1\)\}\{3\.67\}\-0\.13\(0\.02\)\\underset\{\(0\.02\)\}\{0\.13\}0\.108\(0\.015\)\\underset\{\(0\.015\)\}\{0\.108\}400\.009621074M1\.37e−07\(2\.7e−07\)\\underset\{\(2\.7e\-07\)\}\{1\.37e\-07\}\-5\.63\(0\.35\)\\underset\{\(0\.35\)\}\{5\.63\}4\.3\(0\.19\)\\underset\{\(0\.19\)\}\{4\.3\}\-0\.126\(0\.0087\)\\underset\{\(0\.0087\)\}\{0\.126\}0\.125\(0\.008\)\\underset\{\(0\.008\)\}\{0\.125\}250\.0111Three\-term1\.08e−11\(2\.2e−11\)\\underset\{\(2\.2e\-11\)\}\{1\.08e\-11\}12\.6\(0\.59\)\\underset\{\(0\.59\)\}\{12\.6\}4\.9\(0\.2\)\\underset\{\(0\.2\)\}\{4\.9\}4\.27\(0\.15\)\\underset\{\(0\.15\)\}\{4\.27\}0\.132\(0\.0043\)\\underset\{\(0\.0043\)\}\{0\.132\}0\.139\(0\.008\)\\underset\{\(0\.008\)\}\{0\.139\}0\.182\(0\.0074\)\\underset\{\(0\.0074\)\}\{0\.182\}2460\.0159Table 6:Same as[Table5](https://arxiv.org/html/2607.01487#A1.T5), but forOpenEuroLLMdataset\.Model sizeNNEEAABBCCα\\alphaβ\\betaγ\\gammaSamplesMAD50M2\.3\(0\.048\)\\underset\{\(0\.048\)\}\{2\.3\}\-2\.26\(0\.42\)\\underset\{\(0\.42\)\}\{2\.26\}2\.27\(0\.38\)\\underset\{\(0\.38\)\}\{2\.27\}\-0\.161\(0\.026\)\\underset\{\(0\.026\)\}\{0\.161\}0\.2\(0\.026\)\\underset\{\(0\.026\)\}\{0\.2\}430\.00654130M1\.79\(0\.11\)\\underset\{\(0\.11\)\}\{1\.79\}\-3\.45\(0\.76\)\\underset\{\(0\.76\)\}\{3\.45\}2\.92\(0\.7\)\\underset\{\(0\.7\)\}\{2\.92\}\-0\.164\(0\.028\)\\underset\{\(0\.028\)\}\{0\.164\}0\.184\(0\.04\)\\underset\{\(0\.04\)\}\{0\.184\}390\.00798300M1\.16\(0\.13\)\\underset\{\(0\.13\)\}\{1\.16\}\-3\.07\(0\.21\)\\underset\{\(0\.21\)\}\{3\.07\}2\.51\(0\.056\)\\underset\{\(0\.056\)\}\{2\.51\}\-0\.113\(0\.018\)\\underset\{\(0\.018\)\}\{0\.113\}0\.138\(0\.007\)\\underset\{\(0\.007\)\}\{0\.138\}380\.00656600M1\.09\(0\.021\)\\underset\{\(0\.021\)\}\{1\.09\}\-4\.41\(0\.36\)\\underset\{\(0\.36\)\}\{4\.41\}2\.58\(0\.057\)\\underset\{\(0\.057\)\}\{2\.58\}\-0\.159\(0\.011\)\\underset\{\(0\.011\)\}\{0\.159\}0\.122\(0\.0085\)\\underset\{\(0\.0085\)\}\{0\.122\}360\.005621000M1\.02\(0\.12\)\\underset\{\(0\.12\)\}\{1\.02\}\-5\.17\(2\)\\underset\{\(2\)\}\{5\.17\}2\.8\(0\.16\)\\underset\{\(0\.16\)\}\{2\.8\}\-0\.169\(0\.042\)\\underset\{\(0\.042\)\}\{0\.169\}0\.128\(0\.012\)\\underset\{\(0\.012\)\}\{0\.128\}350\.00718Three\-term0\.264\(0\.23\)\\underset\{\(0\.23\)\}\{0\.264\}180\(24\)\\underset\{\(24\)\}\{180\}2\.62\(0\.059\)\\underset\{\(0\.059\)\}\{2\.62\}2\.73\(0\.16\)\\underset\{\(0\.16\)\}\{2\.73\}0\.292\(0\.0077\)\\underset\{\(0\.0077\)\}\{0\.292\}0\.0705\(0\.017\)\\underset\{\(0\.017\)\}\{0\.0705\}0\.156\(0\.011\)\\underset\{\(0\.011\)\}\{0\.156\}1910\.0128
### A\.4Fitting2TLand3TLonOpenEuroLLMDataset

![Refer to caption](https://arxiv.org/html/2607.01487v1/x17.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x18.png)

Figure 9:Same as[Fig\.1](https://arxiv.org/html/2607.01487#S4.F1), but forOpenEuroLLMdataset\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x19.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x20.png)

Figure 10:Same as[Fig\.2](https://arxiv.org/html/2607.01487#S4.F2), but forOpenEuroLLMdataset\.\(Left\)Training set,\(right\)validation set\.
### A\.5Ablation on Value ofδ\\delta

When fitting the standard Chinchilla form \([1](https://arxiv.org/html/2607.01487#S1.E1)\) to theLidataset withδ=10−3\\delta=10^\{\-3\}, we observe that the variance in the parameter estimates appears to be unbalanced; this is likely due to the rather small range ofNNin the dataset, which has been reported to cause issues for fitting scaling laws\(Liet al\.,[2025b](https://arxiv.org/html/2607.01487#bib.bib15)\)\. We find that usingδ=10−5\\delta=10^\{\-5\}fixes this\. Hence, we perform an ablation withδ=10−5\\delta=10^\{\-5\}for the two\-term and three\-term scaling laws, to check whether the value ofδ\\deltaimpacts also the fit of those laws\.

Below we show the results of[Section4\.1](https://arxiv.org/html/2607.01487#S4.SS1), but usingδ=10−5\\delta=10^\{\-5\}instead ofδ=10−3\\delta=10^\{\-3\}\. In short, for3TLwe observe almost identical results, albeit with slightly higher variance for the scaling ofM⋆M^\{\\star\}\. For2TL, we observe a slightly worse MAD when usingδ=10−5\\delta=10^\{\-5\}, as well as higher variance for the coefficients\(γ,β\)\(\\gamma,\\beta\)as well as the scaling ofM⋆M^\{\\star\}\.

Overall, the choice ofδ\\deltadoes not have big impact on the conclusion of[Section4\.1](https://arxiv.org/html/2607.01487#S4.SS1), withδ=10−3\\delta=10^\{\-3\}being slightly preferable\.

![Refer to caption](https://arxiv.org/html/2607.01487v1/x21.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x22.png)

Figure 11:Same as[Fig\.2](https://arxiv.org/html/2607.01487#S4.F2), but withδ=10−5\\delta=10^\{\-5\}\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x23.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x24.png)

Figure 12:Same as[Fig\.1](https://arxiv.org/html/2607.01487#S4.F1), but withδ=10−5\\delta=10^\{\-5\}\.
### A\.6Additional Plots

#### A\.6\.1LiDataset Overview

![Refer to caption](https://arxiv.org/html/2607.01487v1/x25.png)Figure 13:Overview of theLidataset used for fitting scaling laws\. Dots with dashed border are part of the validation set\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x26.png)Figure 14:Illustration of the reduced dataset used in[Section4\.2](https://arxiv.org/html/2607.01487#S4.SS2), with 3 batch size valuesbbper sweep\. See also caption of[Fig\.13](https://arxiv.org/html/2607.01487#A1.F13); here, the difference is that for each value of\(N,D\)\(N,D\)we use only three different batch sizes \(lying on a diagonal\) for fitting, while the validation set remains the same as before\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x27.png)Figure 15:Overview of the fullLidataset \(before learning\-rate selection\)\. Each heatmap represents the final loss over a grid of batch sizebb\(yy\-axis\) and learning rateη\\eta\(xx\-axis\) for a single combination of\(N,D\)\(N,D\)\. Blue squares mark the optimal combination of\(η,b\)\(\\eta,b\), gray squares mark optimal learning rate for the given row of batch size\. Note that most marked squares do not lie on the border, therefore indicating that the sweep is sufficiently extensive\.
#### A\.6\.2Suboptimal Batch Size Scaling:Li

![Refer to caption](https://arxiv.org/html/2607.01487v1/x28.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x29.png)

Figure 16:Same as[Fig\.5](https://arxiv.org/html/2607.01487#S4.F5), but forN=429N=429M\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x30.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x31.png)

Figure 17:Same as[Fig\.5](https://arxiv.org/html/2607.01487#S4.F5), but forN=537N=537M\.
#### A\.6\.3Suboptimal Batch Size Scaling:OpenEuroLLM

![Refer to caption](https://arxiv.org/html/2607.01487v1/x32.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x33.png)

Figure 18:Same as[Fig\.5](https://arxiv.org/html/2607.01487#S4.F5), but forOpenEuroLLMdataset withN=300N=300M\.![Refer to caption](https://arxiv.org/html/2607.01487v1/x34.png)
![Refer to caption](https://arxiv.org/html/2607.01487v1/x35.png)

Figure 19:Same as[Fig\.5](https://arxiv.org/html/2607.01487#S4.F5), but forOpenEuroLLMdataset withN=1N=1B\.
#### A\.6\.4Reduced Dataset Fit forOpenEuroLLM

![Refer to caption](https://arxiv.org/html/2607.01487v1/x36.png)\(a\)Dataset reduced to 63%
![Refer to caption](https://arxiv.org/html/2607.01487v1/x37.png)\(b\)Dataset reduced to 42%

Figure 20:Same as[Fig\.3](https://arxiv.org/html/2607.01487#S4.F3), but forOpenEuroLLMdataset\. Fitting on a reduced dataset, with only 3 values ofbbper sweep\(left\)and 2 values\(right\)\.
How to Allocate Your Tokens? Scaling Laws with Training Steps and Batch Size

Similar Articles

Scaling Laws, Carefully (25 minute read)

Prescriptive Scaling Laws for Data Constrained Training

Unified Neural Scaling Laws

Scaling laws for neural language models

Scaling Laws for Behavioral Foundation Models over User Event Sequences

Submit Feedback

Similar Articles

Scaling Laws, Carefully (25 minute read)
Prescriptive Scaling Laws for Data Constrained Training
Scaling laws for neural language models
Scaling Laws for Behavioral Foundation Models over User Event Sequences