Unified Neural Scaling Laws
Summary
This paper presents Unified Neural Scaling Laws (UNSL), a functional form that accurately models and extrapolates deep neural network scaling behaviors as multiple dimensions such as parameters, data, and steps vary simultaneously, improving over previous scaling laws.
View Cached Full Text
Cached at: 05/27/26, 09:06 AM
# Unified Neural Scaling Laws
Source: [https://arxiv.org/html/2605.26248](https://arxiv.org/html/2605.26248)
Ethan Caballero Mila, University of Montreal ethan\.victor\.caballero@gmail\.com ethan\.caballero@mila\.quebec&Priyank Jaini Google DeepMind &David Krueger Mila, University of Montreal & Irina Rish Mila, University of Montreal
###### Abstract
We present a functional form \(that we refer to as aUnified Neural Scaling Law \(UNSL\)\) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously \(i\.e\. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, and various hyperparameters\) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks\. When compared to other functional forms for neural scaling, this functional form yieldsextrapolationsof scaling behavior that are considerably more accurate on this set\.
## 1Introduction
Training today’s state\-of\-the\-art neural networks requires significant amounts of computational resources and training data\. Given a wide range of available methods and architectures to choose from, accurate forecasting of their performance is essential for selecting those that are likely to perform best at scale, especially since the top\-performing methods at smaller scales often fail to maintain their performance at larger scales\(Sutton,[2019](https://arxiv.org/html/2605.26248#bib.bib1); Tolstikhinet al\.,[2021](https://arxiv.org/html/2605.26248#bib.bib59)\)\. Moreover, accurate forecasting of neural network behaviors at scale is critical not only for identifying the top\-performing approaches but also for ensuring AI safety, as predicting the emergence of novel capabilities at scale is essential for responsible development and deployment of advanced AI systems\. This realization motivated the study ofneural scaling laws\(Corteset al\.,[1994](https://arxiv.org/html/2605.26248#bib.bib42); Hestnesset al\.,[2017](https://arxiv.org/html/2605.26248#bib.bib43); Rosenfeldet al\.,[2019](https://arxiv.org/html/2605.26248#bib.bib45); Kaplanet al\.,[2020](https://arxiv.org/html/2605.26248#bib.bib20); Zhaiet al\.,[2021](https://arxiv.org/html/2605.26248#bib.bib46); Abnaret al\.,[2021](https://arxiv.org/html/2605.26248#bib.bib47); Brownet al\.,[2020](https://arxiv.org/html/2605.26248#bib.bib23); Bahriet al\.,[2021](https://arxiv.org/html/2605.26248#bib.bib94); Alabdulmohsinet al\.,[2022](https://arxiv.org/html/2605.26248#bib.bib51); Caballeroet al\.,[2023](https://arxiv.org/html/2605.26248#bib.bib2)\)which aim to predict the behavior of large\-scale models as the amount of compute, data, and model parameters increases\.
Clearly, the accuracy, as well as the confidence of predictions made by neural scaling laws can only increase \(or remain the same\) as a larger number of relevant predictors are included, due to the standard conditional entropy inequality,H\(Y\|𝐗\)≤H\(Y\)H\(Y\|\{\\bf X\}\)\\leq H\(Y\), where𝐗\{\\bf X\}is the vector of predictive variables andYYis the performance evaluation metric\. Namely, as the number of predictive variablesXi,i=1,…,mX\_\{i\},i=1,\.\.\.,mincreases, the conditional entropyH\(Y\|\(X1,…,Xm\)\)H\(Y\|\(X\_\{1\},\.\.\.,X\_\{m\}\)\)can only decrease \(or remain the same\)\. Ultimately, to obtain the maximal achievable reduction in the entropy ofYY, one would need to identify the set of all possibleXiX\_\{i\}that are causally related toYY, and develop a complete modelP\(Y\|𝐗\)P\(Y\|\{\\bf X\}\)that can serve as a “unified functional form” of neural network behavior\(s\) at scale\.
To address this need for a \(more\) unified functional form, we presentUnified Neural Scaling Laws \(UNSL\), a functional form that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously\. When compared to other functional forms for neural scaling, this functional form yieldsextrapolationsof scaling behavior that are considerably more accurate on this set\. Additionally, this functional form accurately models and extrapolates multivariate scaling behavior that other functional forms are incapable of expressing such as the nonmonotonic transitions present in the scaling behavior of overfitting and hyperparameters \(such as learning rate and standard deviation of weights at initialization\) that have a nonmonotonic relationship with the performance evaluation metric\.
## 2The Functional Form of Unified Neural Scaling Laws

Figure 1:An illustration of a Unified Neural Scaling Law \(UNSL\) \(dark solid lines\) with two input dimensions,x1x\_\{1\}andx2x\_\{2\}; the central and the right plots show the projections on each of the input dimensions, respectively\. In this particular example, an UNSL contains 3 hyperbreaks highlighted by brighter dotted lines \- orange, yellow, and green\. The green hyperbreak is created by a non\-bottleneck component\. The orange hyperbreak is created by anx1x\_\{1\}bottleneck component\. The yellow hyperbreak is created by anx2x\_\{2\}bottleneck component\. See Section[2](https://arxiv.org/html/2605.26248#S2)for detailed explanation of hyperbreaks\.Letyydenote a performance evaluation metric of interest, e\.g\. prediction error or cross\-entropy, "upstream" \(i\.e\., measured on the validation dataset from the pretraining data distribution\) or "downstream" \(i\.e\., measured on new data and/or tasks that the model does not encounter during pretraining\), and let\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}denote a tuple ofmmquantities that can be viewed as predictors ofyy, e\.g\. number of model parameters, training dataset size, number of training steps, number of inference steps, and values of various hyperparameters\.
We present the following general functional form of a unified neural scaling law \(UNSL\):
y=a0\+\(\(Q\(3\)\+\(Q\(S\+4\)\+a1−1\)−1⏟oppositional force of overfitting\)−1\+a2−1\)−1,y=a\_\{0\}\+\\left\(\\bigg\(Q\(3\)\+\\underbrace\{\\left\(Q\(S\+4\)\+\{a\_\{1\}\}^\{\-1\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of overfitting\}\}\\bigg\)^\{\-1\}\+\{a\_\{2\}\}^\{\-1\}\\right\)^\{\-1\},\(1\)
whereQQis defined as follows:
Q\(q\)=\(\(R\(q\)\)−1\+aq−1\)−1\+∑s=1S\(R\(q\+s\)\+aq\+s−1\)−1⏟oppositional force of hyperparameters,\\begin\{split\}Q\(&q\)=\\left\(\\left\(R\(q\)\\right\)^\{\-1\}\+\{a\_\{q\}\}^\{\-1\}\\right\)^\{\-1\}\+\\underbrace\{\\sum\_\{s=1\}^\{S\}\\left\(R\(q\+s\)\+\{a\_\{q\+s\}\}^\{\-1\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of hyperparameters\}\},\\end\{split\}\(2\)
whereRRis defined as follows:
R\(r\)=K\(Ur,nr0,r⋅\(m\+1\)\)⏟non\-bottleneck component\+∑t∈TrK\(\{t\},nrt,r⋅\(m\+1\)\+t\)⏟bottleneck components,whereUr,Tr⊆\{1,…,m\},\\begin\{split\}R\(&r\)=\\underbrace\{K\(\\,U\_\{r\},\\,\\,n\_\{r\_\{0\}\},\\,\\,r\\\!\\cdot\\\!\(m\\\!\+\\\!1\)\)\}\_\{\\text\{non\-bottleneck component\}\}\\,\+\\,\\underbrace\{\\sum\_\{\\mathclap\{\\kern 2\.56073ptt\\in T\_\{r\}\}\}K\(\\\{t\\\},\\,\\,n\_\{r\_\{t\}\},\\,\\,r\\\!\\cdot\\\!\(m\\\!\+\\\!1\)\\\!\+\\\!t\)\}\_\{\\text\{bottleneck components\}\},\\\\ &\\hskip\-10\.81204pt\\text\{where \}U\_\{r\},T\_\{r\}\\subseteq\\\{1,\\ldots,m\\\},\\end\{split\}\(3\)
and whereKKis aMultivariate Broken Neural Scaling Law \(MBNSL\), defined as follows:
K\(M,n,k\)=bk⋅\(∏i∈Mxi−ci0k\)∏j=1n\(1\+\(∏i∈Mxicijkdjk\)\|1fjk\|\)−fjk\.K\(M,n,k\)=b\_\{k\}\\cdot\\left\(\\prod\_\{i\\in M\}x\_\{i\}^\{\-c\_\{i\_\{0\_\{k\}\}\}\}\\right\)\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{\\prod\_\{\\begin\{subarray\}\{c\}i\\in M\\end\{subarray\}\}x\_\{i\}^\{c\_\{i\_\{j\_\{k\}\}\}\}\}\{d\_\{j\_\{k\}\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\_\{k\}\}\}\\right\|\}\\right\)^\{\-f\_\{j\_\{k\}\}\}\.\(4\)
The parameters whose values are unknown constants that must be estimated by fitting the above functional form to the \(x1…xm,yx\_\{1\}\.\.\.x\_\{m\},y\) data points are all those whose base is one of these:a,b,c,d,fa,b,c,d,f\.
The purpose of the variablesii,jj,kk,qq,rr,ss,ttis indexation\.nnis a bound of a product operator; as a result, each ofnr0n\_\{r\_\{0\}\}andnrtn\_\{r\_\{t\}\}implicitly is a bound of a product operator\.SSis a bound of a summation operator\.M⊆\{1,…,m\}M\\subseteq\\\{1,\\ldots,m\\\}\.MMis a product index set; as a result,UrU\_\{r\}implicitly is a product index set\.TrT\_\{r\}is a summation index set\.K,Q,RK,Q,Rare functions and the contents of the parentheses inK\(⋅\),Q\(⋅\),R\(⋅\)K\(\\cdot\),Q\(\\cdot\),R\(\\cdot\)are arguments of those functions\. Whenever an argument ofKK,QQ, orRRis obtained via addition\(s\) and/or multiplication\(s\), the sole reason that those additions and multiplications occur is to cause each instantiation ofKKto have a unique value forkk\.
Equations[1](https://arxiv.org/html/2605.26248#S2.E1),[2](https://arxiv.org/html/2605.26248#S2.E2),[3](https://arxiv.org/html/2605.26248#S2.E3), and[4](https://arxiv.org/html/2605.26248#S2.E4)are interpreted as follows\.
We use the termmulti\-log spaceto refer to the \(m\+1\)\-dimensional space obtained by applying the logarithmic transformation to each of every dimension \(x1…xm,yx\_\{1\}\.\.\.x\_\{m\},y\)\.
Equation[4](https://arxiv.org/html/2605.26248#S2.E4)is an extension of the univariatebroken neural scaling law \(BNSL\)ofCaballeroet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib2)\)to multivariate settings\. When\|M\|=1\|M\|=1, its expressivity is identical to the univariate broken neural scaling law functional form \(with the performance limit term subtracted out\) fromCaballeroet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib2)\)\. When\|M\|\>1\|M\|\>1, Equation[4](https://arxiv.org/html/2605.26248#S2.E4)defines a sequence ofn\+1n\+1smoothly connected hyperplanes in multi\-log space\. Constantnncorresponds to the number of \(smooth\) “hyperbreaks” \(i\.e\. transitions\) betweenn\+1n\+1consecutive hyperplanes in multi\-log space; the dimensionality of each hyperplane is\|M\|\|M\|, and the dimensionality of each hyperbreak is\|M\|−1\|M\|\-1\. Whenn=0n=0, Equation[4](https://arxiv.org/html/2605.26248#S2.E4)becomesbk∏i∈Mxi−ci0kb\_\{k\}\\prod\_\{i\\in M\}x\_\{i\}^\{\-c\_\{i\_\{0\_\{k\}\}\}\}\. In multi\-log space, the initial exponent for each input dimension\(ci0k\)i∈M\(c\_\{i\_\{0\_\{k\}\}\}\)\_\{i\\in M\}corresponds to the gradient of the first hyperplane with respect to the input dimensions\(xi\)i∈M\(x\_\{i\}\)\_\{i\\in M\}\. In multi\-log space,bkb\_\{k\}corresponds to the offset of the output of Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\. The j\-th hyperplane smoothly transitions to the \(j\+1\)th hyperplane at the values of\(xi\)i∈M\(x\_\{i\}\)\_\{i\\in M\}for which this equality is true:djk=∏i∈Mxicijkd\_\{j\_\{k\}\}=\\prod\_\{i\\in M\}x\_\{i\}^\{c\_\{i\_\{j\_\{k\}\}\}\}\. The j\-th exponent for each input dimension\(cijk\)i∈M\(c\_\{i\_\{j\_\{k\}\}\}\)\_\{i\\in M\}multiplied bysign\(fjk\)\\operatorname\{sign\}\(f\_\{j\_\{k\}\}\)corresponds to the change in gradient \(with respect to the input dimensions\(xi\)i∈M\(x\_\{i\}\)\_\{i\\in M\}\) between the j\-th hyperplane and the \(j\+1\)th hyperplane in multi\-log space\. Constantfjkf\_\{j\_\{k\}\}represents the sharpness of the hyperbreak between the j\-th and the \(j\+1\)th hyperplane in multi\-log space; smaller values of\|fjk\|\|f\_\{j\_\{k\}\}\|yield a sharper hyperbreak and regions \(before and after the j\-th hyperbreak\) that have less curvature in multi\-log space; larger values of\|fjk\|\|f\_\{j\_\{k\}\}\|yield a smoother \(wider\) hyperbreak and regions \(before and after the j\-th hyperbreak\) that have more curvature in multi\-log space\.
Equation[3](https://arxiv.org/html/2605.26248#S2.E3)consists of 2 kinds of components\. The componentK\(Ur,nr0,r⋅\(m\+1\)\)K\(U\_\{r\},\\,\\,n\_\{r\_\{0\}\},\\,\\,r\\\!\\cdot\\\!\(m\\\!\+\\\!1\)\)is referred to as a “non\-bottleneck” component and corresponds to the smoothly connected hyperplanes \(in multi\-log space\) as described in the previous paragraph\. Each of the components summed together in the summation∑t∈TrK\(\{t\},nrt,r⋅\(m\+1\)\+t\)\\sum\_\{\\begin\{subarray\}\{c\}\\\\ \\\\ \\kern\-11\.09654ptt\\in T\_\{r\}\\end\{subarray\}\}\\kern\-6\.25958ptK\(\\\{t\\\},\\,n\_\{r\_\{t\}\},\\,r\\\!\\cdot\\\!\(m\\\!\+\\\!1\)\\\!\+\\\!t\)is referred to as a “bottleneck” component and corresponds to each of the performance limits when bottlenecked by each of the dimensions\(xt\)t∈Tr\(x\_\{t\}\)\_\{t\\in T\_\{r\}\}\.
Equation[2](https://arxiv.org/html/2605.26248#S2.E2)is as follows\.R\(q\)R\(q\)represents everything that has been discussed thus far in this Section[2](https://arxiv.org/html/2605.26248#S2);aqa\_\{q\}represents a misperformance limit \(e\.g\., the cross\-entropy or test error rate of random guessing\)\. The remaining contents of Equation[2](https://arxiv.org/html/2605.26248#S2.E2)represent the “oppositional force” of hyperparameters \(such as learning rate and standard deviation of weights at initialization\) that have an oppositional relationship with the performance evaluation metric; for example, when learning rate and/or standard deviation of weights at initialization are too large, they exert an “oppositional force” on the value ofQ\(q\)Q\(q\)\.SSrepresents the number of misperformance limits of the “oppositional force” of hyperparameters;SSdoesnotrepresent any other quantities \(e\.g\.SSdoes not represent the number of hyperparameters\)\.In practice,S≤1S\\leq 1except in relatively contrived scenarios\(e\.g\. scenarios in which it is simultaneously true that number of training steps is very small \(e\.g\. smaller than 5 steps\) and learning rate is a value greater than 1\) such as the scaling behavior shown in Figure[9](https://arxiv.org/html/2605.26248#S17.F9)of Appendix[17\.5](https://arxiv.org/html/2605.26248#S17.SS5)\.
Equation[1](https://arxiv.org/html/2605.26248#S2.E1)is as follows\.Q\(3\)Q\(3\)represents everything that has been discussed thus far in this Section[2](https://arxiv.org/html/2605.26248#S2)\. The constanta0a\_\{0\}corresponds to the limit as to how far the value ofyycan be reduced \(or maximized\) even if all ofx1…xmx\_\{1\}\.\.\.x\_\{m\}go to the values of\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}that yield the global optimum ofyy\. The constanta2a\_\{2\}corresponds to a misperformance limit that is caused by the particular performance evaluation metric that is used\. For example, when using a performance evaluation metric \(such as cross\-entropy\) that is unbounded above,a2=∞a\_\{2\}=\\infty\(i\.e\.a2−1=0a\_\{2\}^\{\-1\}=0\); and when using a performance evaluation metric \(such as error rate\) that is bounded above,a2<∞a\_\{2\}<\\infty\. The remaining contents \(of Equation[1](https://arxiv.org/html/2605.26248#S2.E1)\), i\.e\. the inner reciprocal\(Q\(S\+4\)\+a1−1\)−1\\left\(Q\(S\+4\)\+\{a\_\{1\}\}^\{\-1\}\\right\)^\{\-1\}, correspond to the “oppositional force” exerted by overfitting\. When one trains a model for more than one epoch, this inner reciprocal becomes a non\-negligible number that is considerably larger than zero\.
### 2\.1The Additive Symmetry

Figure 2:An illustration of an example configuration of Equation[5](https://arxiv.org/html/2605.26248#S2.E5)with two input dimensions,x1x\_\{1\}andx2x\_\{2\}\. All 3 plots are of the same scaling behavior\. See Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)for more details\.The following expression111In Equation[5](https://arxiv.org/html/2605.26248#S2.E5),bb,ci0c\_\{i\_\{0\}\},gg, andhih\_\{i\}are constants estimated by fitting Equation[5](https://arxiv.org/html/2605.26248#S2.E5)to \(x1…xm,yx\_\{1\}\.\.\.x\_\{m\},y\) data points\.implicitly shows up in several places \(when an addition takes place\) in Equations[1](https://arxiv.org/html/2605.26248#S2.E1),[2](https://arxiv.org/html/2605.26248#S2.E2), and[3](https://arxiv.org/html/2605.26248#S2.E3):
y=b⋅\(∏i=1mxi−ci0\)\+g⋅\(∏i=1mxihi\),\\hskip\-52\.63759pty=b\\cdot\\left\(\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\+g\\cdot\\left\(\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{h\_\{i\}\}\\right\),\(5\)
and is equivalent to a \(n=1,M=\{1,…,m\}n=1,M=\\\{1,\\ldots,m\\\}\) version of Equation[4](https://arxiv.org/html/2605.26248#S2.E4):
y=b⋅\(∏i=1mxi−ci0\)\(1\+\(∏i=1mxici1d\)\|1f\|\)−f,y=b\\cdot\\left\(\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\\left\(1\+\\left\(\\frac\{\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{c\_\{i\_\{1\}\}\}\}\{d\}\\right\)^\{\\left\|\\frac\{1\}\{f\}\\right\|\}\\right\)^\{\-f\},\(6\)
when all these equalities are simultaneously true:
f\\displaystyle f=−1,ci1=ci0\+hi,d=b/g\.\\displaystyle=\-1\\hskip 5\.69054pt,\\hskip 31\.2982ptc\_\{i\_\{1\}\}=c\_\{i\_\{0\}\}\+h\_\{i\}\\hskip 5\.69054pt,\\hskip 31\.2982ptd=b/g\\hskip 5\.69054pt\.
Equation[5](https://arxiv.org/html/2605.26248#S2.E5)is different from Equation[6](https://arxiv.org/html/2605.26248#S2.E6)in that \(assumingbb,gg, andddare positive numbers\):
1. 1\.For Equation[5](https://arxiv.org/html/2605.26248#S2.E5), the change in gradient \(with respect to the input dimensionsx1,…,xmx\_\{1\},\.\.\.,x\_\{m\}as anyxix\_\{i\}increases\) between the 1st hyperplane and the 2nd hyperplane in multi\-log space is always nonnegative; meanwhile, for Equation[6](https://arxiv.org/html/2605.26248#S2.E6), this change in gradient can be any amount\.
2. 2\.For Equation[5](https://arxiv.org/html/2605.26248#S2.E5), the sharpness of the hyperbreak between the 1st and the 2nd hyperplane in multi\-log space is dependent solely on the amount of change in gradient between the 1st hyperplane and the 2nd hyperplane in multi\-log space; meanwhile, for Equation[6](https://arxiv.org/html/2605.26248#S2.E6), this sharpness is dependent on the value offf\(and as a result is decoupled from the amount of change in gradient between the 1st hyperplane and the 2nd hyperplane in multi\-log space\)\.
Empirically, we observe that nonmonotonic transitions always seem to be characterized by Equation[5](https://arxiv.org/html/2605.26248#S2.E5)rather than[6](https://arxiv.org/html/2605.26248#S2.E6)\. As a result, \(when an addition takes place in the center\) in Equations[1](https://arxiv.org/html/2605.26248#S2.E1)and[2](https://arxiv.org/html/2605.26248#S2.E2), we implicitly use Equation[5](https://arxiv.org/html/2605.26248#S2.E5)to model phenomena \(e\.g\. overfitting and hyperparameters such as learning rate and standard deviation of weights at initialization\) that are capable of exhibiting a nonmonotonic relationship with the performance evaluation metric\.
Empirically, we observe that transitions to or from regions in which the gradient \(with respect to at least one of the input dimensionsx1,…,xmx\_\{1\},\.\.\.,x\_\{m\}\) is equal to zero always seem to be characterized by a version of Equation[5](https://arxiv.org/html/2605.26248#S2.E5)in which eachhih\_\{i\}\(inh1,…,hmh\_\{1\},\.\.\.,h\_\{m\}\) for which the gradient with respect toxix\_\{i\}\(inx1,…,xmx\_\{1\},\.\.\.,x\_\{m\}\) is equal to zero is equal to zero\. As a result, we implicitly use that version of Equation[5](https://arxiv.org/html/2605.26248#S2.E5)when addition takes place in Equation[3](https://arxiv.org/html/2605.26248#S2.E3)and when addition takes place with a parameter whose base isaain parts of Equations[1](https://arxiv.org/html/2605.26248#S2.E1)and[2](https://arxiv.org/html/2605.26248#S2.E2)\.
Note that Equation[5](https://arxiv.org/html/2605.26248#S2.E5)sums twon=0n=0versions of MBNSL of Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\. To extend the relations discussed in this Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)thus far to a summation of two MBNSLs that each have an arbitrary number of hyperbreaksnn, see Appendix[7](https://arxiv.org/html/2605.26248#S7)\.
### 2\.2Desiderata
The UNSL functional form satisfies all of the following desiderata:
1. 1\.Each univariate scaling behavior is a univariatebroken neural scaling law \(BNSL\)ofCaballeroet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib2)\)\. This means that \(as discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)\) for a significant subset of transitions between consecutive hyperplanes \(in multi\-log space\) the sharpness needs to be decoupled from the amount of change in gradient \(i\.e\. via the extra expressivity granted byffin Equation[6](https://arxiv.org/html/2605.26248#S2.E6)\(and Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\)\)\.
2. 2\.The position of break\(s\) \(within univariate scaling behaviors\) within hyperbreak\(s\) created by non\-bottleneck components are shifted via multiplication in a way that is dependent on other input dimensions\.
3. 3\.Whenever all but onexix\_\{i\}dimension inx1…xmx\_\{1\}\.\.\.x\_\{m\}simultaneously go to the values of\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}that yield the global optimum ofyy, that performance limit is dependent on the value of that singlexix\_\{i\}dimension \(that is bottlenecking performance\) and no other dimension inx1…xmx\_\{1\}\.\.\.x\_\{m\}\. When sufficiently close to the global optimum ofyy, the transition to that performance limit is characterized by the functional formy=a\+∑t∈Trbt⋅xt−cty=a\+\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\cdot x\_\{t\}^\{\-c\_\{t\}\}\.
4. 4\.The performance limit as allxix\_\{i\}dimensions inx1…xmx\_\{1\}\.\.\.x\_\{m\}simultaneously go to the values of\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}that yield the global optimum ofyyis dependent on a constant \(e\.g\. the irreducible entropy or Bayes error\)\. The transition to this performance limit is characterized by summing an entire functional form with a constant \(e\.g\.a0a\_\{0\}\)\.
5. 5\.The misperformance limit \(e\.g\. upper limits when using metrics such as error or loss for which a lower value of that metric is considered better\) when the amount of misperformance is not bottlenecked by anyxix\_\{i\}inx1…xmx\_\{1\}\.\.\.x\_\{m\}is dependent on a constant\. The transition to this misperformance limit is characterized by raising to the \-1 power the sum of a functional form and a constant\. Examples of such misperformance limits in some scenarios are the loss or error of a random guessing \(maximum entropy\) model and in other scenarios are a value much larger than the loss or error of a random guessing \(maximum entropy\) model\.
6. 6\.Whenever all but onexix\_\{i\}dimension inx1…xmx\_\{1\}\.\.\.x\_\{m\}simultaneously go to the values of\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}that yield the globally worst value ofyy, that misperformance limit \(e\.g\. upper limits when using metrics such as error or loss for which a lower value of that metric is considered better\) is dependent on the value of that singlexix\_\{i\}dimension \(that is bottlenecking misperformance\) and no other dimension inx1…xmx\_\{1\}\.\.\.x\_\{m\}\. When sufficiently far from the global optimum ofyy, the transition to that misperformance limit is characterized by the functional formy=\(a\+∑t∈Trbt⋅xt−ct\)−1y=\\left\(a\+\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\cdot x\_\{t\}^\{\-c\_\{t\}\}\\right\)^\{\-1\}\. Examples of such misperformance limits are the high loss or error obtained when training dataset size is too small \(i\.e\. such that overfitting occurs\)\.
7. 7\.Nonmonotonic transitions \(e\.g\. due to overfitting and hyperparameters such as learning rate and standard deviation of weights at initialization\) are characterized by the additive symmetry discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)\.
8. 8\.The “oppositional forces” of hyperparameters oppose “good learning” \(i\.e\. the subset of learning that is not considered to be overfitting\) and “bad learning” \(i\.e\. the subset of learning that is considered to be overfitting\)\.
See Appendix[16](https://arxiv.org/html/2605.26248#S16)for explanation of how UNSL functional form satisfies all of these desiderata\. See Appendix[17](https://arxiv.org/html/2605.26248#S17)for evidence that all of these desiderata are empirically true\.
## 3Related Work
To the best of our knowledge,Rosenfeldet al\.\([2019](https://arxiv.org/html/2605.26248#bib.bib45)\)was the first to describe a functional form with multivariate input; this functional form isy=a\+b1x1−c1\+b2x2−c2y=a\+b\_\{1\}\{x\_\{1\}\}^\{\-c\_\{1\}\}\+b\_\{2\}\{x\_\{2\}\}^\{\-c\_\{2\}\}in whichx1x\_\{1\}is number of model parameters andx2x\_\{2\}is training dataset size\.Kaplanet al\.\([2020](https://arxiv.org/html/2605.26248#bib.bib20)\)\(and others such asHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\)\) used this same functional form, but hadx2x\_\{2\}be number of training steps multiplied by training batch size; we refer to this functional form as “CF”\.
Muennighoffet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib10)\)introduced this functional form \(that we refer to as “DC”\) with trivariate input:
y=a\+b1⋅\(UN\+UN⋅d1⋅\(1−e\(−1⋅RN/\(d1\)\)\)\)−c1\+b2⋅\(x3\+x3⋅d2⋅\(1−e\(−1⋅RD/\(d2\)\)\)\)−c2y=a\+b\_\{1\}\\cdot\(U\_\{N\}\+U\_\{N\}\\cdot d\_\{1\}\\cdot\(1\-e^\{\(\-1\\cdot R\_\{N\}/\(d\_\{1\}\)\)\}\)\)^\{\-c\_\{1\}\}\+b\_\{2\}\\cdot\(x\_\{3\}\+x\_\{3\}\\cdot d\_\{2\}\\cdot\(1\-e^\{\(\-1\\cdot R\_\{D\}/\(d\_\{2\}\)\)\}\)\)^\{\-c\_\{2\}\};
in that functional form:
RD=max\(0,\(x2/x3\)−1\)R\_\{D\}=\\max\(0,\(x\_\{2\}/x\_\{3\}\)\-1\),
UN=min\(x1,\(x3⋅\(\(c1⋅b1\)/\(c2⋅b2\)\)\(1/\(c1\+c2\)\)\)\(c2/c1\)⋅\(\(c1⋅b1\)/\(c2⋅b2\)\)\(1/\(c1\+c2\)\)\)U\_\{N\}=\\min\(x\_\{1\},\(x\_\{3\}\\cdot\(\(c\_\{1\}\\cdot b\_\{1\}\)/\(c\_\{2\}\\cdot b\_\{2\}\)\)^\{\(1/\(c\_\{1\}\+c\_\{2\}\)\)\}\)^\{\(c\_\{2\}/c\_\{1\}\)\}\\cdot\(\(c\_\{1\}\\cdot b\_\{1\}\)/\(c\_\{2\}\\cdot b\_\{2\}\)\)^\{\(1/\(c\_\{1\}\+c\_\{2\}\)\)\}\),
RN=max\(0,\(x1/UN\)−1\)R\_\{N\}=\\max\(0,\(x\_\{1\}/U\_\{N\}\)\-1\),
x1x\_\{1\}is number of model parameters,x2x\_\{2\}is number of training steps multiplied by training batch size, andx3x\_\{3\}is training dataset size\. When training dataset size is so large that one only trains for one epoch, functional form “DC” is mathematically identical to functional form “CF”\.
See Appendix[19](https://arxiv.org/html/2605.26248#S19)for additional related work\.
## 4Empirical Results: Fits & Extrapolations of Functional Forms
We now show the fits &extrapolationsof various functional forms\.In all plots here, onward, & in the appendix, triangle\-shaped points are points used for fitting a functional form, circle\-shaped points are held\-out points used for evaluating extrapolation of functional form fit to the triangle\-shaped points, & lines are the functional form that has been fit to triangle\-shaped points\. The color of each line and \(the inside of\) each point represents its value along the color bar dimension\. Lines of the functional form are intentionally only rendered at the values of the color bar dimension for which there exists at least one \(triangle\-shaped or circle\-shaped\) point; this means that the vertical distance of each point from the line \(that is the same color as that point\) represents the error of the functional form when fitting \(or extrapolating to\) that point\. 100% of the plots in this paper here, onward, & in the appendix contain circle\-shaped point\(s\) for evaluating extrapolation\.
See Appendix[10](https://arxiv.org/html/2605.26248#S10)for details on fitting UNSL\. See Appendix[20](https://arxiv.org/html/2605.26248#S20)for code that implements UNSL\. See Appendix[13](https://arxiv.org/html/2605.26248#S13)for an analysis of how the number of observed points used for fitting affects extrapolation accuracy\. See Appendix[14](https://arxiv.org/html/2605.26248#S14)for an example of UNSL accurately extrapolating to scales an order of magnitude larger in multiple dimensions simultaneously\. See Appendix[12](https://arxiv.org/html/2605.26248#S12)for how to obtain the compute\-optimal values of the input dimensions from a fitted UNSL\.
All the extrapolation evaluations reported in the tables \(that have↓\\downarrowsymbol in the top row\) are reported in terms of root mean squared log error \(RMSLE\) ± root standard log error\. See Appendix[8](https://arxiv.org/html/2605.26248#S8)for definition of RMSLE and Appendix[9](https://arxiv.org/html/2605.26248#S9)for definition of root standard log error\.
#### 4\.0\.1Ablation functional forms
A1 functional form refers to the baseline ablation functional form in which all the additive symmetries discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)have been removed such that this A1 baseline functional form is Equation[4](https://arxiv.org/html/2605.26248#S2.E4), i\.e\. :
y=b⋅\(∏i=1mxi−ci0\)∏j=1n\(1\+\(∏i=1mxicijdj\)\|1fj\|\)−fj\.y=b\\cdot\\left\(\\prod\_\{i=1\}^\{m\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{\\prod\_\{i=1\}^\{m\}x\_\{i\}^\{c\_\{i\_\{j\}\}\}\}\{d\_\{j\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\}\\right\)^\{\-f\_\{j\}\}\.
A2 functional form refers to the baseline ablation functional form that consists solely of Equation[3](https://arxiv.org/html/2605.26248#S2.E3)\(which consists of Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\) plus the constanta0a\_\{0\}\(which corresponds to the limit as to how far the value ofyycan be reduced \(or maximized\) even if all ofx1…xmx\_\{1\}\.\.\.x\_\{m\}go to the values of\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}that yield the global optimum ofyy\):
y=a0\+K\(U0,n00,0\)\+∑t∈T0K\(\{t\},n0t,t\),whereU0,T0⊆\{1,…,m\}\.y=a\_\{0\}\+K\(\\,U\_\{0\},\\,\\,n\_\{0\_\{0\}\},\\,\\,0\)\\,\+\\,\\sum\_\{\\mathclap\{\\kern 2\.56073ptt\\in T\_\{0\}\}\}K\(\\\{t\\\},\\,\\,n\_\{0\_\{t\}\},\\,\\,t\),\\hskip 10\.81204pt\\text\{where \}U\_\{0\},T\_\{0\}\\subseteq\\\{1,\\ldots,m\\\}\.
A2 functional form incorporates more of the additive symmetries discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)than A1 functional form does\.
A3 functional form refers to the baseline ablation functional form that consists solely of Equation[2](https://arxiv.org/html/2605.26248#S2.E2)\(which consists of Equation[3](https://arxiv.org/html/2605.26248#S2.E3)\(which consists of Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\)\) plus the constanta0a\_\{0\}\(which corresponds to the limit as to how far the value ofyycan be reduced \(or maximized\) even if all ofx1…xmx\_\{1\}\.\.\.x\_\{m\}go to the values of\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}that yield the global optimum ofyy\):
y=a0\+\(\(\(\(R\(0\)\)−1\+a1−1\)−1\+∑s=1S\(R\(s\)\+as\+2−1\)−1\)−1\+a2−1\)−1\.y=a\_\{0\}\+\\left\(\\left\(\\left\(\\left\(R\(0\)\\right\)^\{\-1\}\+\{a\_\{1\}\}^\{\-1\}\\right\)^\{\-1\}\+\\sum\_\{s=1\}^\{S\}\\left\(R\(s\)\+\{a\_\{s\+2\}\}^\{\-1\}\\right\)^\{\-1\}\\right\)^\{\-1\}\+a\_\{2\}^\{\-1\}\\right\)^\{\-1\}\.
A3 functional form incorporates more of the additive symmetries discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)than A2 functional form does\. UNSL functional form incorporates all of the additive symmetries discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)\(i\.e\. more than A3 functional form does\)\.
#### 4\.0\.2Summary of Results
Table 1:Percentage of tasks by domain where each functional form is the best forextrapolationof scaling behavior\. See Sections[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)and[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.A1, A2, A3, and UNSL all have the exact same supremal expressivity\. As a result, the fact that UNSL is better for extrapolation than A1, A2, and A3 in Table[1](https://arxiv.org/html/2605.26248#S4.T1)is due to the fact that UNSL enforces more of the desiderata \(of Section[2\.2](https://arxiv.org/html/2605.26248#S2.SS2)\) \(e\.g\., via incorporating all of the symmetries discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)\) than A1, A2, and A3 do\.

Figure 3:UNSL accurately Extrapolating Downstream Performance; there are many additional accurate extrapolation results in Appendix[18](https://arxiv.org/html/2605.26248#S18)\. Experimental data of scaling behavior in left plot is downstream performance on CSR \(Common Sense Reasoning\), i\.e\. downstream zero\-shot mean test error rate across HellaSwag, ARC \(easy and challenge\), PIQA, WinoGrande, OpenBookQA, SIQA, and BoolQ; see Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\. Experimental data of scaling behavior in right plot is few\-shot downstream performance on ImageNet; see Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.
### 4\.1Vision
We evaluate how well various functional forms extrapolate performance on downstream vision tasks as multiple dimensions vary simultaneously\. The tasks that are evaluated are test error rate on each of various few\-shot downstream image classification tasks; the downstream tasks are: Birds 200\(Welinderet al\.,[2010](https://arxiv.org/html/2605.26248#bib.bib53)\), Cars 196\(Krauseet al\.,[2013](https://arxiv.org/html/2605.26248#bib.bib106)\), and ImageNet\(Denget al\.,[2009](https://arxiv.org/html/2605.26248#bib.bib52)\)\. The following architectures of various sizes are pre\-trained on subsets of JFT\-300M\(Sunet al\.,[2017](https://arxiv.org/html/2605.26248#bib.bib56)\): vision transformers \(ViT\)\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2605.26248#bib.bib58)\), MLP mixers \(MiX\)\(Tolstikhinet al\.,[2021](https://arxiv.org/html/2605.26248#bib.bib59)\), and big\-transfer residual neural networks \(BiT\)\(Kolesnikovet al\.,[2020](https://arxiv.org/html/2605.26248#bib.bib57)\)\. The bivariate subset of this scaling behavior data is obtained via correspondence with authors ofAlabdulmohsinet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib51)\); the simultaneously varying dimensions of the bivariate scaling behavior are training dataset size and number of training steps\. The trivariate subset of this scaling behavior data is obtained from the ViT/16 results ofZhaiet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib116)\); the simultaneously varying dimensions of the trivariate scaling behavior are training dataset size, number of training steps, and number of model parameters\. As can be seen in Tables[1](https://arxiv.org/html/2605.26248#S4.T1),[3](https://arxiv.org/html/2605.26248#S4.T3), and[2](https://arxiv.org/html/2605.26248#S4.T2), UNSL yields extrapolations with the lowest RMSLE \(Root Mean Squared Logarithmic Error\) for 60\.87% of tasks of any of the functional forms, while the next best functional form performs the best on only 21\.74% of the tasks\. To view plots of UNSL, DC, A1, A2, and A3 on each of these bivariate scaling behaviors, in Appendix[18\.5](https://arxiv.org/html/2605.26248#S18.SS5)respectively see Figures[17](https://arxiv.org/html/2605.26248#S18.F17),[18](https://arxiv.org/html/2605.26248#S18.F18),[19](https://arxiv.org/html/2605.26248#S18.F19),[20](https://arxiv.org/html/2605.26248#S18.F20),[21](https://arxiv.org/html/2605.26248#S18.F21)\. To view plots of UNSL, DC, A1, A2, and A3 on each of these trivariate scaling behaviors, in Appendix[18\.5](https://arxiv.org/html/2605.26248#S18.SS5)respectively see Figures[22](https://arxiv.org/html/2605.26248#S18.F22),[23](https://arxiv.org/html/2605.26248#S18.F23),[24](https://arxiv.org/html/2605.26248#S18.F24),[25](https://arxiv.org/html/2605.26248#S18.F25),[26](https://arxiv.org/html/2605.26248#S18.F26)\.
In Appendix[18\.1](https://arxiv.org/html/2605.26248#S18.SS1), we additionally show that UNSL accurately extrapolates the multivariate scaling behavior of reinforcement learning\.
In Appendix[18\.3](https://arxiv.org/html/2605.26248#S18.SS3), we additionally show that UNSL accurately extrapolates multivariate scaling behavior as width and depth vary simultaneously\.
In Appendix[18\.4](https://arxiv.org/html/2605.26248#S18.SS4), we additionally show that UNSL accurately extrapolates multivariate scaling behavior when batch size is an input dimension to UNSL\.
In Figure[11](https://arxiv.org/html/2605.26248#S17.F11)of Appendix[17\.7](https://arxiv.org/html/2605.26248#S17.SS7), we additionally show that UNSL accurately extrapolates the trivariate scaling behavior as learning rate, standard deviation of weights at initialization, and number of training steps all vary simultaneously\.
Table 2:Extrapolation Results for trivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.Table 3:Extrapolation Results for bivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.
### 4\.2Language
We evaluate how well various functional forms extrapolate performance on downstream \(and upstream\) language tasks as multiple dimensions vary simultaneously\. As can be seen in Tables[1](https://arxiv.org/html/2605.26248#S4.T1),[4](https://arxiv.org/html/2605.26248#S4.T4), and[5](https://arxiv.org/html/2605.26248#S4.T5), UNSL yields extrapolations with the lowest RMSLE \(Root Mean Squared Logarithmic Error\) for 88\.89% of tasks of any of the functional forms, while the next best functional form performs the best on only 11\.11% of the tasks\. To view plots of UNSL, DC, A1, A2, and A3 on trivariate scaling behavior, in Appendix[18\.6\.1](https://arxiv.org/html/2605.26248#S18.SS6.SSS1)respectively see Figures[27](https://arxiv.org/html/2605.26248#S18.F27),[28](https://arxiv.org/html/2605.26248#S18.F28),[29](https://arxiv.org/html/2605.26248#S18.F29),[30](https://arxiv.org/html/2605.26248#S18.F30),[31](https://arxiv.org/html/2605.26248#S18.F31); this trivariate scaling behavior data is from scaling behavior data released byMuennighoffet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib10)\), and the simultaneously varying dimensions of these trivariate scaling behaviors are number of model parameters, number of tokens processed, and number of tokens in training dataset\. To view plots of UNSL, CF, A1, and A2 on each of these bivariate scaling behaviors, in Appendix[18\.6\.2](https://arxiv.org/html/2605.26248#S18.SS6.SSS2)respectively see Figures[32](https://arxiv.org/html/2605.26248#S18.F32),[33](https://arxiv.org/html/2605.26248#S18.F33),[34](https://arxiv.org/html/2605.26248#S18.F34), and[35](https://arxiv.org/html/2605.26248#S18.F35); the simultaneously varying dimensions of these bivariate scaling behaviors are number of model parameters and number of training steps \(or number of tokens processed\)\. There is no A3 in Table[5](https://arxiv.org/html/2605.26248#S4.T5)because UNSL becomes A3 in the scenario of Table[5](https://arxiv.org/html/2605.26248#S4.T5), i\.e\. the scenario in which training dataset size is effectively infinite such that one only trains for one epoch\. The bivariate scaling behaviors that are referred to as "constant" are obtained from the LLaMA and HGRN2 portions of Figures 1 and 2 ofShenet al\.\([2024](https://arxiv.org/html/2605.26248#bib.bib103)\); they are referred to as "constant" because the learning rate is held constant and a learning rate schedule is not used\. The bivariate scaling behaviors that are referred to as "chinchilla" are obtained via correspondence with authors ofHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\); they are called "chinchilla" because they use "chinchilla\-scaling" \(i\.e\. a learning rate schedule that is chosen to be training compute optimal as inHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\)\) and are the scaling behavior data fromHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\)\. CSR \(Common Sense Reasoning\) is zero\-shot mean test error rate across HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.26248#bib.bib110)\), ARC \(easy and challenge\)\(Clarket al\.,[2018](https://arxiv.org/html/2605.26248#bib.bib112)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.26248#bib.bib108)\), WinoGrande\(Sakaguchiet al\.,[2020](https://arxiv.org/html/2605.26248#bib.bib111)\), OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.26248#bib.bib113)\), SIQA\(Sapet al\.,[2019](https://arxiv.org/html/2605.26248#bib.bib109)\), and BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2605.26248#bib.bib107)\)\.
In Appendix[18\.2](https://arxiv.org/html/2605.26248#S18.SS2), we additionally show that UNSL accurately extrapolates the multivariate scaling behavior of inference \(i\.e\. test\-time\) scaling\.
Table 4:Extrapolation Results for trivariate scaling behavior of language performance\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.Table 5:Extrapolation Results for bivariate scaling behavior of downstream \(and upstream\) language performance\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.
## 5The Limit of the Predictability of Scaling Behavior


Figure 4:Extrapolation of UNSL on scaling behavior of an MLP trained for a single epoch on the \(n, k\)\-sparse parity task \(withn=40n=40andk=4k=4\) ofBaraket al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib105)\)\. Each point in the left plot is the mean of greater than 100 seeds\. In the left plot, each point is gathered from an MLP trained for a single epoch on the \(n, k\)\-sparse parity task \(withn=40n=40andk=4k=4\) ofBaraket al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib105)\)\. In the right plot, each point is gathered from a noiseless simulation of the UNSL of the scaling behavior of that \(n, k\)\-sparse parity task\. See Section[5](https://arxiv.org/html/2605.26248#S5)and Appendix[11](https://arxiv.org/html/2605.26248#S11)for more details\.We use UNSL to glean insights about the limit of the predictability of scaling behavior\. In Figure[4](https://arxiv.org/html/2605.26248#S5.F4)left, UNSL accurately extrapolates the scaling behavior of the sparse parity task ofBaraket al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib105)\), despite the fact that this task famously does not exhibit any observable progress in loss \(nor error\) for the first few hundred training steps\. In Figure[4](https://arxiv.org/html/2605.26248#S5.F4)right, we use a noiseless simulation of the UNSL of the scaling behavior of the sparse parity task to show what would happen if one had infinitely many training runs / seeds to average out all the noisy deviation between runs such that one could recover \(i\.e\. learn via curve\-fitting\) the learned constants of the UNSL as well as possible\. We observe the following:
- •To accurately extrapolate past each hyperbreak, the shortest distance to each hyperbreak from \(the convex hull of\) the points used for fitting must be sufficiently small\.
## 6Discussion
We have presented the unified neural scaling law \(UNSL\) functional form that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously \(i\.e\. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, and various hyperparameters\) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks\. When compared to other functional forms for neural scaling, this functional form yieldsextrapolationsof scaling behavior that are considerably more accurate on this set\.
#### Acknowledgments
We are thankful for useful feedback and assistance from Ben Adlam, Ibrahim Alabdulmohsin, Sebastian Borgeaud, Kevin Clark and others\.
## Appendix
## 7Extension of the Additive Symmetry relations discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)to a summation of two MBNSLs that each have an arbitrary number of hyperbreaksnn
Note that Equation[5](https://arxiv.org/html/2605.26248#S2.E5)sums twon=0n=0versions of MBNSL of Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\. To extend the relations discussed in Section[2\.1](https://arxiv.org/html/2605.26248#S2.SS1)to a summation of two MBNSLs that each have an arbitrary number of hyperbreaksnn, for each of those two MBNSLs one needs to obtain then=0n=0version\(wb∏i=1mxiwci\)\(w\_\{b\}\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{w\_\{c\_\{i\}\}\}\)of MBNSL of Equation[4](https://arxiv.org/html/2605.26248#S2.E4)that is the tangent hyperplane in multi\-log space\. The values ofwbw\_\{b\}andwciw\_\{c\_\{i\}\}that yield the tangent hyperplane in multi\-log space are:
wci=−ci0−∑j=1nsign\(fj\)⋅cij⋅\(1\+\(∏l=1mxlcljdj\)−\|1fj\|\)−1,w\_\{c\_\{i\}\}=\-c\_\{i\_\{0\}\}\-\\sum\_\{j=1\}^\{n\}\\operatorname\{sign\}\(f\_\{j\}\)\\cdot c\_\{i\_\{j\}\}\\cdot\\left\(1\+\\left\(\\frac\{\\prod\_\{l=1\}^\{m\}x\_\{l\}^\{c\_\{l\_\{j\}\}\}\}\{d\_\{j\}\}\\right\)^\{\-\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\}\\right\)^\{\-1\},
wb=b⋅\(∏l=1mxl−cl0\)\(∏j=1n\(1\+\(∏l=1mxlcljdj\)\|1fj\|\)−fj\)∏i=1mxi−wci\.w\_\{b\}=b\\cdot\\left\(\\prod\_\{l=1\}^\{m\}x\_\{l\}^\{\-c\_\{l\_\{0\}\}\}\\right\)\\left\(\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{\\prod\_\{l=1\}^\{m\}x\_\{l\}^\{c\_\{l\_\{j\}\}\}\}\{d\_\{j\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\}\\right\)^\{\-f\_\{j\}\}\\right\)\\prod\_\{i=1\}^\{m\}x\_\{i\}^\{\-w\_\{c\_\{i\}\}\}\.
## 8Definition of Root Mean Squared Log Error
Root\_Mean\_Squared\_Log\_Error=RMSLE=1N∑i=1N\(log\(yi\)−log\(y^i\)\)2Root\\\_Mean\\\_Squared\\\_Log\\\_Error=RMSLE=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(\\log\(y\_\{i\}\)\-\\log\(\\hat\{y\}\_\{i\}\)\)^\{2\}\}
## 9Definition of Root Standard Log Error
errori=\(log\(yi\)−log\(y^i\)\)2error\_\{i\}=\(\\log\(y\_\{i\}\)\-\\log\(\\hat\{y\}\_\{i\}\)\)^\{2\}μerror=1N∑i=1Nerrori\\mu\_\{error\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}error\_\{i\}σerror=1N−1∑i=1N\(errori−μerror\)2\\sigma\_\{error\}=\\sqrt\{\\frac\{1\}\{N\-1\}\\sum\_\{i=1\}^\{N\}\(error\_\{i\}\-\\mu\_\{error\}\)^\{2\}\}Root\_Standard\_Log\_Error=μerror\+σerrorlen\(y^\)−μerrorRoot\\\_Standard\\\_Log\\\_Error=\\sqrt\{\\mu\_\{error\}\+\\frac\{\\sigma\_\{error\}\}\{\\sqrt\{len\(\\hat\{y\}\)\}\}\}\-\\sqrt\{\\mu\_\{error\}\}
## 10Experimental Details of Fitting UNSL
We fit the UNSL by implementing it in KFAC\-JAX\(Botev and Martens,[2022](https://arxiv.org/html/2605.26248#bib.bib102)\)and minimizing mean squared log error \(MSLE\):
MSLE=1N∑i=1N\(log\(yi\+ϵ\)−log\(yi^\+ϵ\)\)2,MSLE=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(\\log\(y\_\{i\}\+\\epsilon\)\-\\log\(\\hat\{y\_\{i\}\}\+\\epsilon\)\)^\{2\},\(7\)
withϵ=10−16\\epsilon=10^\{\-16\}\. We also employ L2 regularization on the exponents of the UNSL with a weighting ofλ\\lambdarelative to the MSLE loss term\.
The values ofnn\(from Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\) \(andSSfrom Equation[2](https://arxiv.org/html/2605.26248#S2.E2)\) andλ\\lambdathat yield the lowest extrapolation error can be obtained as follows\. Split the set of observed points \(i\.e\. the triangle shaped points\) used for fitting into two sets, a validation set and a training set; for each of every point in the validation set, the training set should not contain a point that is simultaneously larger than each of everyxxdimension \(x1…xmx\_\{1\}\.\.\.x\_\{m\}\) of that validation set point\. The values ofnn,SS, andλ\\lambdawith the lowest validation error when fitting on the remaining training points are then used\. Note that once the values ofnn,SS, andλ\\lambdaare identified, the validation set is added back to the training set; \(and the hold\-out points \(i\.e\. the circle shaped points\) are still held out to evaluate extrapolation RMSLE\)\. In practice,S≤1S\\leq 1unless the scaling behavior of interest is an extravagant scaling behavior that is similar to the scaling behavior shown in Figure[9](https://arxiv.org/html/2605.26248#S17.F9)of Appendix[17\.5](https://arxiv.org/html/2605.26248#S17.SS5)\.
It takes 20000 training steps and 20 seeds of random initialization for KFAC\-JAX to converge when fitting a UNSL\. We use the JAX default “LeCun Normal” initialization as the distribution from which each random initialization \(for each seed\) is drawn from for parameters of UNSL\. Unlike the values ofnn\(from Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\) \(andSSfrom Equation[2](https://arxiv.org/html/2605.26248#S2.E2)\) andλ\\lambda, the optimal seed that is selected is that which yields the lowest training error \(not the lowest validation error\)\.
## 11Experimental Details of Sections[5](https://arxiv.org/html/2605.26248#S5),[13](https://arxiv.org/html/2605.26248#S13),[17](https://arxiv.org/html/2605.26248#S17)\(besides Figure[10](https://arxiv.org/html/2605.26248#S17.F10)\), and[14](https://arxiv.org/html/2605.26248#S14)
For all figures in Sections[5](https://arxiv.org/html/2605.26248#S5),[13](https://arxiv.org/html/2605.26248#S13),[17](https://arxiv.org/html/2605.26248#S17)\(besides Figure[10](https://arxiv.org/html/2605.26248#S17.F10)\), and[14](https://arxiv.org/html/2605.26248#S14):
- •The batch size is 80000\. No regularization is used because training dataset size is∼\\siminfinite such that model is only trained for a single epoch\. Adam is used\. Adam hyperparameters areβ1=0\{\\beta\}\_\{1\}=0andβ2=0\{\\beta\}\_\{2\}=0\(except for Figures[11](https://arxiv.org/html/2605.26248#S17.F11)and[12](https://arxiv.org/html/2605.26248#S17.F12)\(and Table[6](https://arxiv.org/html/2605.26248#S17.T6)\) in Section[17\.7](https://arxiv.org/html/2605.26248#S17.SS7)in whichβ1=0\.9\{\\beta\}\_\{1\}=0\.9andβ2=0\.999\{\\beta\}\_\{2\}=0\.999\)\. Except when learning rate and/or standard deviation of weights at initialization are explicitly varied in the plots of figures, learning rate and standard deviation of weights at initialization are held constant\.
In Figure[4](https://arxiv.org/html/2605.26248#S5.F4), number of model parameters is varied by varying width when depth is held constant\.
## 12Obtaining the Compute\-Optimal Values of the Input Dimensions
Let𝒟\\mathcal\{D\}be the index set that contains the indexes of dimensions of\(x1,…,xm\)\(x\_\{1\},\.\.\.,x\_\{m\}\)that directly contribute to amount of training compute used \(e\.g\. number of model parameters, number of training steps, etc\.\)\. Letℋ\\mathcal\{H\}be the index set that contains the indexes of dimensions of\(x1,…,xm\)\(x\_\{1\},\.\.\.,x\_\{m\}\)that do not directly contribute to amount of training compute used \(e\.g\. learning rate, standard deviation of weights at initialization, etc\.\)\.CCis amount of training compute used\.C0C\_\{0\}is a constant \(e\.g\. equal to66inHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\)\) such thatC0=C/\(∏i∈𝒟xi\)\.C\_\{0\}=C/\(\\prod\_\{i\\in\\mathcal\{D\}\}x\_\{i\}\)\.λ\\lambdais a Lagrange multiplier\.
To obtain the values of\(x1,…,xm\)\(x\_\{1\},\.\.\.,x\_\{m\}\)that yield the lowest value ofyyfor a given value ofCC, one solves following system of equations:
∂y∂xℓ\+λCxℓ=0,ℓ∈𝒟,\\frac\{\\partial y\}\{\\partial x\_\{\\ell\}\}\+\\lambda\\frac\{C\}\{x\_\{\\ell\}\}=0,\\quad\\ell\\in\\mathcal\{D\},∂y∂xv=0,v∈ℋ,\\frac\{\\partial y\}\{\\partial x\_\{v\}\}=0,\\quad v\\in\\mathcal\{H\},C−C0∏ℓ∈𝒟xℓ=0\.C\-C\_\{0\}\\prod\_\{\\ell\\in\\mathcal\{D\}\}x\_\{\\ell\}=0\.
## 13Effect of varying the number of observed points used for fitting UNSL functional form








Figure 5:Varying the number of observed points used for fitting UNSL functional form from9e09\\mathrm\{e\}0\(in top left plot\) to9e29\\mathrm\{e\}2\(in bottom right plot\)\. Scaling behavior is that of an MLP trained for a single epoch on the \(n, k\)\-sparse parity task \(withn=40n=40andk=4k=4\) ofBaraket al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib105)\)\. See Appendix[13](https://arxiv.org/html/2605.26248#S13)for more details\.In Figure[5](https://arxiv.org/html/2605.26248#S13.F5), we observe that UNSL accurately extrapolates scaling behavior when only a small number of observed points are used for fitting UNSL functional form\.
## 14UNSL accurately extrapolating to scales an order of magnitude larger in multiple dimensions simultaneously
Figure 6:Extrapolation of UNSL on scaling behavior of an MLP trained for a single epoch on the \(n, k\)\-sparse parity task \(withn=40n=40andk=4k=4\) ofBaraket al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib105)\)\. Each point in the plot is the mean of greater than 100 seeds\. See Section[14](https://arxiv.org/html/2605.26248#S14)and Appendix[11](https://arxiv.org/html/2605.26248#S11)for more details\.In Figure[6](https://arxiv.org/html/2605.26248#S14.F6), UNSL accurately extrapolates to scales an order of magnitude larger in multiple dimensions simultaneously\.
## 15Supremal Expressivity Equivalence of A1, A2, A3, and UNSL
In multi\-log space, MBNSL \(i\.e\. Equation[4](https://arxiv.org/html/2605.26248#S2.E4)and A1\) with\|M\|=m\|M\|=mandnnhyperbreaks is
logy=logb−\(∑i=1mci0logxi\)−∑j=1nfj⋅softplus\(\|1fj\|\(∑i=1mcijlogxi−logdj\)\)\.\\log y\\;=\\;\\log b\\;\-\\;\\left\(\\sum\_\{i=1\}^\{m\}c\_\{i\_\{0\}\}\\,\\log x\_\{i\}\\right\)\\;\-\\;\\sum\_\{j=1\}^\{n\}f\_\{j\}\\cdot\\mathrm\{softplus\}\\\!\\left\(\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\\\!\\left\(\\sum\_\{i=1\}^\{m\}c\_\{i\_\{j\}\}\\,\\log x\_\{i\}\\;\-\\;\\log d\_\{j\}\\right\)\\right\)\.which is a single\-hidden\-layer feedforward network with softplus activation, linear skip connection, andnnhidden units\. Since the softplus function is continuous and non\-polynomial, the universal approximation theorem for non\-polynomial activations\(Leshnoet al\.,[1993](https://arxiv.org/html/2605.26248#bib.bib152); Cybenko,[1989](https://arxiv.org/html/2605.26248#bib.bib153); Hornik,[1991](https://arxiv.org/html/2605.26248#bib.bib154)\)ensures that\{A1:n∈ℕ\}\\\{\\text\{A1\}:n\\in\\mathbb\{N\}\\\}is dense inC\(Ω,ℝ\>0\)C\(\\Omega,\\mathbb\{R\}\_\{\>0\}\)for any compactΩ⊆ℝ\>0m\\Omega\\subseteq\\mathbb\{R\}\_\{\>0\}^\{m\}\.
A2, A3, and UNSL generate positive continuous functions of\(x1,…,xm\)\(x\_\{1\},\\ldots,x\_\{m\}\), which can therefore be arbitrarily well approximated by A1\. Conversely, A1 is derived from each by specifying the corresponding parameters:ai−1=0a\_\{i\}^\{\-1\}=0for allii, ignoring variousKKcomponents using their parameters, and \(for UNSL\)S=0S=0\. Hence A1, A2, A3, and UNSL have identical supremal expressivity\.
## 16Explanation of how UNSL functional form satisfies all of the desiderata of Section[2\.2](https://arxiv.org/html/2605.26248#S2.SS2)
### 16\.1Explanation of how UNSL functional form satisfies Desideratum[1](https://arxiv.org/html/2605.26248#S2.I2.i1)
Desideratum[1](https://arxiv.org/html/2605.26248#S2.I2.i1)says that for each single input dimensionxix\_\{i\}, the scaling behavior follows a univariate broken neural scaling law ofCaballeroet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib2)\), i\.e\.:
y=b⋅xi−ci0∏j=1n\(1\+\(xicijdj\)\|1fj\|\)−fj,y=b\\cdot x\_\{i\}^\{\-\{c\_\{i\}\}\_\{0\}\}\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{x\_\{i\}^\{\{c\_\{i\}\}\_\{j\}\}\}\{d\_\{j\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\}\\right\)^\{\-f\_\{j\}\},wherebb,ci0\{c\_\{i\}\}\_\{0\},ci1…cij\{c\_\{i\}\}\_\{1\}\.\.\.\{c\_\{i\}\}\_\{j\},d1…djd\_\{1\}\.\.\.d\_\{j\}, andf1…fjf\_\{1\}\.\.\.f\_\{j\}are learned parameters\. \(Note that “performance limit” termaafromCaballeroet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib2)\)is intentionally removed here because it is addressed by other desiderata\.\)
This is implemented in Equation[4](https://arxiv.org/html/2605.26248#S2.E4), where each univariate scaling behavior is modeled as a Broken Neural Scaling Law \(BNSL\):
K\(M,n,k\)=bk⋅\(∏i∈Mxi−ci0k\)∏j=1n\(1\+\(∏i∈Mxicijkdjk\)\|1fjk\|\)−fjk\.K\(M,n,k\)=b\_\{k\}\\cdot\\left\(\\prod\_\{i\\in M\}x\_\{i\}^\{\-c\_\{i\_\{0\_\{k\}\}\}\}\\right\)\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{\\prod\_\{\\begin\{subarray\}\{c\}i\\in M\\end\{subarray\}\}x\_\{i\}^\{c\_\{i\_\{j\_\{k\}\}\}\}\}\{d\_\{j\_\{k\}\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\_\{k\}\}\}\\right\|\}\\right\)^\{\-f\_\{j\_\{k\}\}\}\.
For the pedagogical purposes of this Section[16\.1](https://arxiv.org/html/2605.26248#S16.SS1), by settingM=\{1,…,m\}M=\\\{1,\\ldots,m\\\}and removing subscriptkkone can simplify Equation[4](https://arxiv.org/html/2605.26248#S2.E4)to:
y=b⋅\(∏i=1mxi−ci0\)∏j=1n\(1\+\(∏i=1mxicijdj\)\|1fj\|\)−fj\.y=b\\cdot\\left\(\\prod\_\{i=1\}^\{m\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{\\prod\_\{i=1\}^\{m\}x\_\{i\}^\{c\_\{i\_\{j\}\}\}\}\{d\_\{j\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\}\\right\)^\{\-f\_\{j\}\}\.
In that equation, when one varies only a single input dimensionxix\_\{i\}, allxxwith a subscript in “\{1,…,m\}∖\{i\}\\\{1,\\ldots,m\\\}\\setminus\\\{i\\\}” become constants that can be folded intobbordjd\_\{j\}, hence recovering the univariate broken neural scaling law ofCaballeroet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib2)\), i\.e\.:
y=b⋅xi−ci0∏j=1n\(1\+\(xicijdj\)\|1fj\|\)−fj\.y=b\\cdot x\_\{i\}^\{\-\{c\_\{i\}\}\_\{0\}\}\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{x\_\{i\}^\{\{c\_\{i\}\}\_\{j\}\}\}\{d\_\{j\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\}\\right\)^\{\-f\_\{j\}\}\.
### 16\.2Explanation of how UNSL functional form satisfies Desideratum[2](https://arxiv.org/html/2605.26248#S2.I2.i2)
In Equation[4](https://arxiv.org/html/2605.26248#S2.E4), the j\-th hyperbreak \(i\.e\. smooth transition from the j\-th hyperplane to the \(j\+1\)\-th hyperplane\) occurs at the values of\(xi\)i∈M\(x\_\{i\}\)\_\{i\\in M\}for which this equality is true:
djk=∏i∈Mxicijk\.d\_\{j\_\{k\}\}=\\prod\_\{i\\in M\}x\_\{i\}^\{c\_\{i\_\{j\_\{k\}\}\}\}\.
As can be seen from this equality, the location at which each hyperbreak occurs is shifted via multiplicative interaction between \(exponentiations of\) input dimensions\(xi\)i∈M\(x\_\{i\}\)\_\{i\\in M\}\.
### 16\.3Explanation of how UNSL functional form satisfies Desideratum[3](https://arxiv.org/html/2605.26248#S2.I2.i3)
For the pedagogical purposes of this Section[16\.3](https://arxiv.org/html/2605.26248#S16.SS3), by removing subscriptkkone can simplify Equation[4](https://arxiv.org/html/2605.26248#S2.E4)to:
y=b⋅\(∏i∈Mxi−ci0\)∏j=1n\(1\+\(∏i∈Mxicijdj\)\|1fj\|\)−fj\.y=b\\cdot\\left\(\\prod\_\{i\\in M\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\\prod\_\{j=1\}^\{n\}\\left\(1\+\\left\(\\frac\{\\prod\_\{\\begin\{subarray\}\{c\}i\\in M\\end\{subarray\}\}x\_\{i\}^\{c\_\{i\_\{j\}\}\}\}\{d\_\{j\}\}\\right\)^\{\\left\|\\frac\{1\}\{f\_\{j\}\}\\right\|\}\\right\)^\{\-f\_\{j\}\}\.
Whencijc\_\{i\_\{j\}\}andfjf\_\{j\}are constrained to force that functional form to always be nonmonotonic \(and assumingxi\>0,y\>0,b\>0,dj\>0x\_\{i\}\>0,\\;y\>0,\\;b\>0,\\;d\_\{j\}\>0\), that functional form effectively becomes the following monomial when maximally close to the global minima with respect toyy:
y=\(b⋅∏j=1fj\>0ndj\)⋅∏i∈Mxi−\(ci0\+∑j=1fj\>0ncij\)\.y=\\left\(b\\cdot\\prod\_\{\\begin\{subarray\}\{c\}j=1\\\\ f\_\{j\}\>0\\end\{subarray\}\}^\{n\}d\_\{j\}\\right\)\\cdot\\prod\_\{\\begin\{subarray\}\{c\}i\\in M\\end\{subarray\}\}x\_\{i\}^\{\-\\left\(c\_\{i\_\{0\}\}\+\\sum\_\{\\begin\{subarray\}\{c\}j=1\\\\ f\_\{j\}\>0\\end\{subarray\}\}^\{n\}c\_\{i\_\{j\}\}\\right\)\}\.
When Equation[4](https://arxiv.org/html/2605.26248#S2.E4)becomes a monomial, the expressivity of Equation[3](https://arxiv.org/html/2605.26248#S2.E3)becomes equivalent to the expressivity of this functional form:
b⋅\(∏i∈Urxi−ci00\)\+∑t∈Trbtxt−ct0t,b\\cdot\\left\(\\prod\_\{i\\in U\_\{r\}\}x\_\{i\}^\{\-c\_\{i\_\{0\_\{0\}\}\}\}\\right\)\\;\+\\;\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\},
which effectively becomes
∑t∈Trbtxt−ct0t\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\}
when
b∏i∈Urxi−ci00≪∑t∈Trbtxt−ct0t\.b\\prod\_\{i\\in U\_\{r\}\}x\_\{i\}^\{\-c\_\{i\_\{0\_\{0\}\}\}\}\\ll\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\}\.
As a result, it is also true that
a\+b⋅\(∏i∈Urxi−ci00\)\+∑t∈Trbtxt−ct0ta\\;\+\\;b\\cdot\\left\(\\prod\_\{i\\in U\_\{r\}\}x\_\{i\}^\{\-c\_\{i\_\{0\_\{0\}\}\}\}\\right\)\\;\+\\;\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\}
effectively becomes
a\+∑t∈Trbtxt−ct0ta\\;\+\\;\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\}
when
b∏i∈Urxi−ci00≪∑t∈Trbtxt−ct0t\.b\\prod\_\{i\\in U\_\{r\}\}x\_\{i\}^\{\-c\_\{i\_\{0\_\{0\}\}\}\}\\ll\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\}\.
Additionally, the following functional form
a\+∑t∈Trbtxt−ct0ta\\;\+\\;\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\}
effectively becomes
a\+bvxv−cv0v,\(wherev∈Tr\)a\\;\+\\;b\_\{v\}\\,x\_\{v\}^\{\-c\_\{v\_\{0\_\{v\}\}\}\},\\,\\,\\,\\,\(\\text\{where \}v\\in T\_\{r\}\)
when
\(a\+∑t∈Tr∖\{v\}btxt−ct0t\)≪\(a\+bvxv−cv0v\)\.\\Bigg\(a\+\\sum\_\{t\\in T\_\{r\}\\setminus\\\{v\\\}\}b\_\{t\}\\,x\_\{t\}^\{\-c\_\{t\_\{0\_\{t\}\}\}\}\\Bigg\)\\ll\\Bigg\(a\+b\_\{v\}\\,x\_\{v\}^\{\-c\_\{v\_\{0\_\{v\}\}\}\}\\Bigg\)\.
### 16\.4Explanation of how UNSL functional form satisfies Desideratum[4](https://arxiv.org/html/2605.26248#S2.I2.i4)
Recall that Equation[1](https://arxiv.org/html/2605.26248#S2.E1)is:
y=a0\+\(\(Q\(3\)\+\(Q\(S\+4\)\+a1−1\)−1\)−1\+a2−1\)−1\.y=a\_\{0\}\+\\left\(\\left\(Q\(3\)\+\\left\(Q\(S\+4\)\+a\_\{1\}^\{\-1\}\\right\)^\{\-1\}\\right\)^\{\-1\}\+a\_\{2\}^\{\-1\}\\right\)^\{\-1\}\.
Desideratum[4](https://arxiv.org/html/2605.26248#S2.I2.i4)is captured by the addition betweena0a\_\{0\}and\(\(Q\(3\)\+\(Q\(S\+4\)\+a1−1\)−1\)−1\+a2−1\)−1\\left\(\\left\(Q\(3\)\+\\left\(Q\(S\+4\)\+a\_\{1\}^\{\-1\}\\right\)^\{\-1\}\\right\)^\{\-1\}\+a\_\{2\}^\{\-1\}\\right\)^\{\-1\}in this equation, wherea0a\_\{0\}represents the ultimate limit of performance\.
### 16\.5Explanation of how UNSL functional form satisfies Desideratum[5](https://arxiv.org/html/2605.26248#S2.I2.i5)
This is captured in Equations[1](https://arxiv.org/html/2605.26248#S2.E1)and[2](https://arxiv.org/html/2605.26248#S2.E2)when \(the reciprocal of\) each of every variable in the set\{ai\}i\>0\\\{a\_\{i\}\\\}\_\{i\>0\}is summed with a functional form and each resultant sum is then raised to the−1\-1power\. The set\{ai\}i\>0\\\{a\_\{i\}\\\}\_\{i\>0\}contains multiple variables rather than a single variable because misperformance caused by different phenomena often have different misperformance limits\. For example, misperformance caused by overfitting often has a misperformance limit that is significantly worse than the performance of random guessing; meanwhile, misperformance caused by nonoptimal hyperparameters often has at least one misperformance limit that is equal to the performance of random guessing\. The reason that in Equation[2](https://arxiv.org/html/2605.26248#S2.E2)a value ofSSgreater than11\(rather than equal to11\) is sometimes used is that there sometimes are multiple misperformance limitsaq\+sa\_\{q\+s\}\(e\.g\. as in Figure[9](https://arxiv.org/html/2605.26248#S17.F9)of Appendix[17\.5](https://arxiv.org/html/2605.26248#S17.SS5)\): a misperformance limit that is significantly larger than random guessing \(that usually is noticeable when the number of training steps is small\) and a misperformance limit that approximately is less than or equal to random guessing \(that usually is noticeable when the number of training steps is large\)\.
### 16\.6Explanation of how UNSL functional form satisfies Desideratum[6](https://arxiv.org/html/2605.26248#S2.I2.i6)
Recall Appendix[16\.3](https://arxiv.org/html/2605.26248#S16.SS3)\. As a result of Appendix[16\.3](https://arxiv.org/html/2605.26248#S16.SS3), Desideratum[6](https://arxiv.org/html/2605.26248#S2.I2.i6)is captured by each of every instance in whichR\(r\)R\(r\)of Equation[3](https://arxiv.org/html/2605.26248#S2.E3)is effectively raised to the−1\-1power; an instance in whichR\(r\)R\(r\)occurs is considered “effectively raised to the−1\-1power” if the count of reciprocal operations whose scope contains that instance is odd\. Instances in which this occurs are\(R\(q\+s\)\+aq\+s−1\)−1\\left\(\{R\(q\+s\)\+a\_\{q\+s\}\}^\{\-1\}\\right\)^\{\-1\}from Equation[2](https://arxiv.org/html/2605.26248#S2.E2)and\(Q\(S\+4\)\+a1−1\)−1\\left\(Q\(S\+4\)\+\{a\_\{1\}\}^\{\-1\}\\right\)^\{\-1\}from Equation[1](https://arxiv.org/html/2605.26248#S2.E1)\(which contains\(\(R\(q\)\)−1\+aq−1\)−1\\left\(\\left\(R\(q\)\\right\)^\{\-1\}\+\{a\_\{q\}\}^\{\-1\}\\right\)^\{\-1\}from Equation[2](https://arxiv.org/html/2605.26248#S2.E2)\)\.
### 16\.7Explanation of how UNSL functional form satisfies Desideratum[7](https://arxiv.org/html/2605.26248#S2.I2.i7)
This desideratum is captured when\(\(R\(q\)\)−1\+aq−1\)−1\\left\(\\left\(R\(q\)\\right\)^\{\-1\}\+\{a\_\{q\}\}^\{\-1\}\\right\)^\{\-1\}is summed with the “oppositional force of hyperparameters” in Equation[2](https://arxiv.org/html/2605.26248#S2.E2), and whenQ\(3\)Q\(3\)is summed with the “oppositional force of overfitting” in Equation[1](https://arxiv.org/html/2605.26248#S2.E1)\.
### 16\.8Explanation of how UNSL functional form satisfies Desideratum[8](https://arxiv.org/html/2605.26248#S2.I2.i8)
UNSL \(i\.e\. Equation[1](https://arxiv.org/html/2605.26248#S2.E1)\) functional form \(expanded out for pedagogical purposes\) is:
y=\\displaystyle y=a0\+\(\(\(\(R\(3\)\)−1\+a3−1\)−1\+∑s=1S\(R\(3\+s\)\+a3\+s−1\)−1⏟oppositional force of hyperparameters\\displaystyle a\_\{0\}\+\\Bigg\(\\bigg\(\\left\(\\left\(R\(3\)\\right\)^\{\-1\}\+\{a\_\{3\}\}^\{\-1\}\\right\)^\{\-1\}\+\\underbrace\{\\sum\_\{s=1\}^\{S\}\\left\(R\(3\+s\)\+\{a\_\{3\+s\}\}^\{\-1\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of hyperparameters\}\}\+\(a1−1\+\(\(R\(S\+4\)\)−1\+aS\+4−1\)−1\+∑s=1S\(R\(S\+4\+s\)\+aS\+4\+s−1\)−1⏟oppositional force of hyperparameters\)−1⏟oppositional force of overfitting\)−1\+a2−1\)−1\.\\displaystyle\+\\underbrace\{\\left\(\{a\_\{1\}\}^\{\-1\}\+\\left\(\\left\(R\(S\+4\)\\right\)^\{\-1\}\+\{a\_\{S\+4\}\}^\{\-1\}\\right\)^\{\-1\}\+\\underbrace\{\\sum\_\{s=1\}^\{S\}\\left\(R\(S\+4\+s\)\+\{a\_\{S\+4\+s\}\}^\{\-1\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of hyperparameters\}\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of overfitting\}\}\\bigg\)^\{\-1\}\+\{a\_\{2\}\}^\{\-1\}\\Bigg\)^\{\-1\}\.
As can be seen in that expansion of UNSL, oppositional force\(s\) of hyperparameters oppose the “oppositional force of overfitting” and the subset of the UNSL functional form that is not the “oppositional force of overfitting”\. Note that each of every “oppositional force” is nonnegative and that what each of every “oppositional force” opposes is nonnegative\.
## 17Empirical Evidence of Desiderata of Section[2\.2](https://arxiv.org/html/2605.26248#S2.SS2)
### 17\.1Empirical Evidence of Desideratum[1](https://arxiv.org/html/2605.26248#S2.I2.i1)


Figure 7:Extrapolation Results on scaling behavior of an MLP trained for a single epoch on the \(n, k\)\-sparse parity task \(withn=40n=40andk=4k=4\) ofBaraket al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib105)\)\. Left figure fits the functional formy=\(\(\(b∏i=1mxi−ci0\)\(1\+\(∏i=1mxici1d\)\|1f\|\)−f\)−1\+a−1\)−1y=\\left\(\\left\(\\left\(b\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\\left\(1\+\\left\(\\frac\{\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{c\_\{i\_\{1\}\}\}\}\{d\}\\right\)^\{\\left\|\\frac\{1\}\{f\}\\right\|\}\\right\)^\{\-f\}\\right\)^\{\-1\}\+a^\{\-1\}\\right\)^\{\-1\}\. Right figure fits the functional form of left figure whenffis constrained to be11such that the functional form of right figure isy=\(\(\(b∏i=1mxi−ci0\)\(1\+\(∏i=1mxici1d\)\|11\|\)−1\)−1\+a−1\)−1y=\\left\(\\left\(\\left\(b\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\\left\(1\+\\left\(\\frac\{\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{c\_\{i\_\{1\}\}\}\}\{d\}\\right\)^\{\\left\|\\frac\{1\}\{1\}\\right\|\}\\right\)^\{\-1\}\\right\)^\{\-1\}\+a^\{\-1\}\\right\)^\{\-1\}\. Observe that the fits and extrapolations in the top right quadrant of right figure are unsatisfactory\. See Section[17\.1](https://arxiv.org/html/2605.26248#S17.SS1)for more details\.In Figure[7](https://arxiv.org/html/2605.26248#S17.F7), Desideratum[1](https://arxiv.org/html/2605.26248#S2.I2.i1)is true empirically\. As can be seen in Figure[7](https://arxiv.org/html/2605.26248#S17.F7), the sharpness needs to be decoupled from the amount of change in gradient \(i\.e\. via the extra expressivity granted byffin Equation[6](https://arxiv.org/html/2605.26248#S2.E6)\(and Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\)\) in order to accurately fit and accurately extrapolate the scaling behavior\.
### 17\.2Empirical Evidence of Desideratum[2](https://arxiv.org/html/2605.26248#S2.I2.i2)
Recall that in Equation[4](https://arxiv.org/html/2605.26248#S2.E4)\(with subscriptkkremoved for pedagogical purposes\) the j\-th hyperbreak \(i\.e\. smooth transition from the j\-th hyperplane to the \(j\+1\)\-th hyperplane\) occurs at the values of\(xi\)i∈M\(x\_\{i\}\)\_\{i\\in M\}for which this equality is true:
dj=∏i∈Mxicij\.d\_\{j\}=\\prod\_\{i\\in M\}x\_\{i\}^\{c\_\{i\_\{j\}\}\}\.
As a result, desideratum[2](https://arxiv.org/html/2605.26248#S2.I2.i2)is true empirically because the functional formy=\(\(\(b∏i=1mxi−ci0\)\(1\+\(∏i=1mxici1d\)\|1f\|\)−f\)−1\+a−1\)−1y=\\left\(\\left\(\\left\(b\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{\-c\_\{i\_\{0\}\}\}\\right\)\\left\(1\+\\left\(\\frac\{\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{c\_\{i\_\{1\}\}\}\}\{d\}\\right\)^\{\\left\|\\frac\{1\}\{f\}\\right\|\}\\right\)^\{\-f\}\\right\)^\{\-1\}\+a^\{\-1\}\\right\)^\{\-1\}accurately fits and accurately extrapolates the scaling behavior in Figure[7](https://arxiv.org/html/2605.26248#S17.F7)left\.
### 17\.3Empirical Evidence of Desideratum[3](https://arxiv.org/html/2605.26248#S2.I2.i3)
Note that\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}and that\(xt\)t∈Tr∈ℝ¯\>0Tr\(x\_\{t\}\)\_\{t\\in T\_\{r\}\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,T\_\{r\}\}\.
Desideratum[3](https://arxiv.org/html/2605.26248#S2.I2.i3)is observed empirically in several prior works such asHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\)which empirically show that the scaling behavior followsy=a\+∑t∈Trbt⋅xt−cty=a\+\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\cdot x\_\{t\}^\{\-c\_\{t\}\}when sufficiently close to the global optimum ofyy\.
### 17\.4Empirical Evidence of Desideratum[4](https://arxiv.org/html/2605.26248#S2.I2.i4)
Desideratum[4](https://arxiv.org/html/2605.26248#S2.I2.i4)is observed empirically in several prior works such asHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\)in which the transition to the performance limit as allxix\_\{i\}dimensions inx1…xmx\_\{1\}\.\.\.x\_\{m\}simultaneously go to the values of\(xi\)i=1m∈ℝ¯\>0m\(x\_\{i\}\)\_\{i=1\}^\{m\}\\in\\overline\{\\mathbb\{R\}\}\_\{\>0\}^\{\\,m\}that yield the global optimum ofyyis characterized by summing an entire functional form with a constant\.
### 17\.5Empirical Evidence of Desideratum[5](https://arxiv.org/html/2605.26248#S2.I2.i5)
Figure 8:Extrapolation Results of functional formy=\(\(b∏i=1mxi−ci\)−1\+a−1\)−1y=\\left\(\\left\(b\\prod\_\{\\begin\{subarray\}\{c\}i=1\\end\{subarray\}\}^\{m\}x\_\{i\}^\{\-c\_\{i\}\}\\right\)^\{\-1\}\+a^\{\-1\}\\right\)^\{\-1\}\. Scaling behavior is \(top left region of\) that of an MLP trained for a single epoch on dataset ofGreydanus and Kobak \([2024](https://arxiv.org/html/2605.26248#bib.bib104)\)\. See Section[17\.5](https://arxiv.org/html/2605.26248#S17.SS5)for more details\.In Figure[8](https://arxiv.org/html/2605.26248#S17.F8), Desideratum[5](https://arxiv.org/html/2605.26248#S2.I2.i5)is true empirically\. As can be seen in Figure[8](https://arxiv.org/html/2605.26248#S17.F8), the scaling behavior is characterized by raising to the \-1 power the sum of a functional form and a constant\.
Figure 9:Extrapolation Results of UNSL functional form\. Scaling behavior is that of an MLP \(when standard deviation of weights at initialization is large\) trained for a single epoch on dataset ofGreydanus and Kobak \([2024](https://arxiv.org/html/2605.26248#bib.bib104)\)\. See Section[17\.5](https://arxiv.org/html/2605.26248#S17.SS5)for more details\.In Figure[9](https://arxiv.org/html/2605.26248#S17.F9)with regards to Desideratum[5](https://arxiv.org/html/2605.26248#S2.I2.i5), there is a misperformance limit ~equal to random guessing performance \(cross\-entropy of 2\.3\) when it is simultaneously true that learning rate is large \(i\.e\. greater than 3\) and number of training steps is large \(i\.e\. greater than 100\); and an additional misperformance limit equal to a value significantly larger \(i\.e\. larger than the largestyy\-axis value of Figure[9](https://arxiv.org/html/2605.26248#S17.F9)\) than random guessing performance \(cross\-entropy of 2\.3\) occurs when it is simultaneously true that learning rate is large \(i\.e\. significantly greater than 20\) and number of training steps is small \(i\.e\. less than 2\)\. As a result,SS\(from Equation[2](https://arxiv.org/html/2605.26248#S2.E2)\) is equal to22in Figure[9](https://arxiv.org/html/2605.26248#S17.F9)\.
### 17\.6Empirical Evidence of Desideratum[6](https://arxiv.org/html/2605.26248#S2.I2.i6)
Figure 10:Extrapolation Results of functional formy=\(a\+∑t∈Trbt⋅xt−ct\)−1y=\\left\(a\+\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\cdot x\_\{t\}^\{\-c\_\{t\}\}\\right\)^\{\-1\}\. Scaling behavior is \(top region of the scaling behavior of\) downstream ImageNet test error rate of ViT pre\-trained on JFT\. See Sections[17\.6](https://arxiv.org/html/2605.26248#S17.SS6)and[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.In Figure[10](https://arxiv.org/html/2605.26248#S17.F10), Desideratum[6](https://arxiv.org/html/2605.26248#S2.I2.i6)is true empirically\. As can be seen in Figure[10](https://arxiv.org/html/2605.26248#S17.F10), the scaling behavior is characterized by the functional formy=\(a\+∑t∈Trbt⋅xt−ct\)−1y=\\left\(a\+\\sum\_\{t\\in T\_\{r\}\}b\_\{t\}\\cdot x\_\{t\}^\{\-c\_\{t\}\}\\right\)^\{\-1\}\.
### 17\.7Empirical Evidence of Desideratum[7](https://arxiv.org/html/2605.26248#S2.I2.i7)
In Table[6](https://arxiv.org/html/2605.26248#S17.T6)which summarizes Figures[11](https://arxiv.org/html/2605.26248#S17.F11)and[12](https://arxiv.org/html/2605.26248#S17.F12), Desideratum[7](https://arxiv.org/html/2605.26248#S2.I2.i7)is true empirically\. We obtain the trivariate scaling behavior as learning rate, standard deviation of weights at initialization, and number of training steps vary when training an MLP for a single epoch on dataset ofGreydanus and Kobak \([2024](https://arxiv.org/html/2605.26248#bib.bib104)\)\. When holding the number of learned parameters of the functional forms constant, we compare the training and extrapolation RMSLE of UNSL to the following ablated functional form baseline in which the additive symmetries of Equation[2](https://arxiv.org/html/2605.26248#S2.E2)are removed:
y=a0\+K\(U0,n00,0\)\+∑t∈T0K\(\{t\},n0t,t\),whereU0,T0⊆\{1,…,m\}\.\\begin\{split\}y=a\_\{0\}\+K\(\\,U\_\{0\},\\,\\,n\_\{0\_\{0\}\},\\,\\,0\)\\,\+\\,\\sum\_\{\\mathclap\{\\kern 2\.56073ptt\\in T\_\{0\}\}\}K\(\\\{t\\\},\\,\\,n\_\{0\_\{t\}\},\\,\\,t\),\\hskip 10\.81204pt\\text\{where \}U\_\{0\},T\_\{0\}\\subseteq\\\{1,\\ldots,m\\\}\.\\end\{split\}\(8\)
As can be seen in Table[6](https://arxiv.org/html/2605.26248#S17.T6)and Figures[11](https://arxiv.org/html/2605.26248#S17.F11)and[12](https://arxiv.org/html/2605.26248#S17.F12), when holding the number of learned parameters of the functional forms constant, UNSL yields fits and extrapolations with lower RMSLE than the ablated functional form baseline of Equation[8](https://arxiv.org/html/2605.26248#S17.E8)\.
Table 6:Results on trivariate scaling behavior in which Desideratum[7](https://arxiv.org/html/2605.26248#S2.I2.i7)is true empirically\. See Section[17\.7](https://arxiv.org/html/2605.26248#S17.SS7)for more details\.


Figure 11:Extrapolation Results of UNSL\. This trivariate scaling behavior is that of an MLP trained for a single epoch on dataset ofGreydanus and Kobak \([2024](https://arxiv.org/html/2605.26248#bib.bib104)\)\. See Section[17\.7](https://arxiv.org/html/2605.26248#S17.SS7)for more details\.


Figure 12:Extrapolation Results of ablation baseline of Equation[8](https://arxiv.org/html/2605.26248#S17.E8)\. This trivariate scaling behavior is that of an MLP trained for a single epoch on dataset ofGreydanus and Kobak \([2024](https://arxiv.org/html/2605.26248#bib.bib104)\)\. See Section[17\.7](https://arxiv.org/html/2605.26248#S17.SS7)for more details\.
### 17\.8Empirical Evidence of Desideratum[8](https://arxiv.org/html/2605.26248#S2.I2.i8)
In Table[3](https://arxiv.org/html/2605.26248#S4.T3), Desideratum[8](https://arxiv.org/html/2605.26248#S2.I2.i8)is true empirically because UNSL functional form outperforms A3 in the majority of instances\.
Recall that A3 functional form is:
y=a0\+\(\(\(\(R\(0\)\)−1\+a1−1\)−1\+∑s=1S\(R\(s\)\+as\+2−1\)−1⏟all oppositional forces in general\)−1\+a2−1\)−1\.y=a\_\{0\}\+\\left\(\\left\(\\left\(\\left\(R\(0\)\\right\)^\{\-1\}\+\{a\_\{1\}\}^\{\-1\}\\right\)^\{\-1\}\+\\underbrace\{\\sum\_\{s=1\}^\{S\}\\left\(R\(s\)\+\{a\_\{s\+2\}\}^\{\-1\}\\right\)^\{\-1\}\}\_\{\\text\{all oppositional forces in general\}\}\\right\)^\{\-1\}\+a\_\{2\}^\{\-1\}\\right\)^\{\-1\}\.
meanwhile UNSL functional form \(expanded out for pedagogical purposes\) is:
y=\\displaystyle y=a0\+\(\(\(\(R\(3\)\)−1\+a3−1\)−1\+∑s=1S\(R\(3\+s\)\+a3\+s−1\)−1⏟oppositional force of hyperparameters\\displaystyle a\_\{0\}\+\\Bigg\(\\bigg\(\\left\(\\left\(R\(3\)\\right\)^\{\-1\}\+\{a\_\{3\}\}^\{\-1\}\\right\)^\{\-1\}\+\\underbrace\{\\sum\_\{s=1\}^\{S\}\\left\(R\(3\+s\)\+\{a\_\{3\+s\}\}^\{\-1\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of hyperparameters\}\}\+\(a1−1\+\(\(R\(S\+4\)\)−1\+aS\+4−1\)−1\+∑s=1S\(R\(S\+4\+s\)\+aS\+4\+s−1\)−1⏟oppositional force of hyperparameters\)−1⏟oppositional force of overfitting\)−1\+a2−1\)−1\.\\displaystyle\+\\underbrace\{\\left\(\{a\_\{1\}\}^\{\-1\}\+\\left\(\\left\(R\(S\+4\)\\right\)^\{\-1\}\+\{a\_\{S\+4\}\}^\{\-1\}\\right\)^\{\-1\}\+\\underbrace\{\\sum\_\{s=1\}^\{S\}\\left\(R\(S\+4\+s\)\+\{a\_\{S\+4\+s\}\}^\{\-1\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of hyperparameters\}\}\\right\)^\{\-1\}\}\_\{\\text\{oppositional force of overfitting\}\}\\bigg\)^\{\-1\}\+\{a\_\{2\}\}^\{\-1\}\\Bigg\)^\{\-1\}\.
As can be seen in that expansion of UNSL, oppositional force\(s\) of hyperparameters oppose the “oppositional force of overfitting” and the subset of the UNSL functional form that is not the “oppositional force of overfitting”; meanwhile, in A3 functional form, the “oppositional force\(s\) of hyperparameters” does not oppose the “oppositional force of overfitting”\. Note that each of every “oppositional force” is nonnegative and that what each of every “oppositional force” opposes is nonnegative\.
## 18Plots of Extrapolation Results
### 18\.1Plots of Reinforcement Learning Extrapolation Results
Figure 13:Extrapolation Results of UNSL on scaling behavior of reinforcement learning\. Experimental data obtained from Figure 1a ofHiltonet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib78)\)\. Scaling behavior is that of “StarPilot \(hard\)” task ofCobbeet al\.\([2020](https://arxiv.org/html/2605.26248#bib.bib63)\)\. X\-axis is number of frames processed \(i\.e\. “batch size” multiplied by “number of training steps”\) during training\. See Section[18\.1](https://arxiv.org/html/2605.26248#S18.SS1)for more details\.In Figure[13](https://arxiv.org/html/2605.26248#S18.F13), UNSL accurately extrapolates the multivariate scaling behavior of reinforcement learning\.
### 18\.2Plots of Inference Scaling Extrapolation Results
Figure 14:Extrapolation Results of UNSL on scaling behavior of inference scaling\. Experimental data obtained from Figure 4a ofSadhukhanet al\.\([2025](https://arxiv.org/html/2605.26248#bib.bib117)\)\. Scaling behavior is that of test error rate onMathematical Association of America \([2024](https://arxiv.org/html/2605.26248#bib.bib118)\)\. X\-axis is “Chain\-of\-Thought Length” \(measured in number of tokens\), i\.e\. how many “thinking” tokens a model outputs before outputting a final answer\. See Section[18\.2](https://arxiv.org/html/2605.26248#S18.SS2)for more details\.In Figure[14](https://arxiv.org/html/2605.26248#S18.F14), UNSL accurately extrapolates the multivariate scaling behavior of inference \(i\.e\. test\-time\) scaling\.
### 18\.3Plots of “Width vs Depth” Extrapolation Results
Figure 15:Extrapolation Results of UNSL on multivariate scaling behavior as width and depth simultaneously vary\. Scaling behavior is that of an MLP trained for a single epoch on dataset ofGreydanus and Kobak \([2024](https://arxiv.org/html/2605.26248#bib.bib104)\)\. See Section[18\.3](https://arxiv.org/html/2605.26248#S18.SS3)for more details\.In Figure[15](https://arxiv.org/html/2605.26248#S18.F15), UNSL accurately extrapolates multivariate scaling behavior as width and depth vary simultaneously\.
### 18\.4Plots of Multivariate “Batch Size” Extrapolation Results
Figure 16:Extrapolation Results of UNSL on multivariate scaling behavior as batch size and number of training steps simultaneously vary\. Scaling behavior is that of an MLP trained for a single epoch on dataset ofGreydanus and Kobak \([2024](https://arxiv.org/html/2605.26248#bib.bib104)\)\. See Section[18\.4](https://arxiv.org/html/2605.26248#S18.SS4)for more details\.In Figure[16](https://arxiv.org/html/2605.26248#S18.F16), UNSL accurately extrapolates multivariate scaling behavior when batch size is an input dimension to UNSL\.
### 18\.5Plots of Vision Extrapolation Results
#### 18\.5\.1Bivariate


Figure 17:Extrapolation Results of UNSL on bivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 18:Extrapolation Results of “DC” functional form ofMuennighoffet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib10)\)on bivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 19:Extrapolation Results of A1 functional form on bivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 20:Extrapolation Results of A2 functional form on bivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 21:Extrapolation Results of A3 functional form on bivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.
#### 18\.5\.2Trivariate


Figure 22:Extrapolation Results of UNSL functional form on trivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 23:Extrapolation Results of “DC” functional form ofMuennighoffet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib10)\)on trivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 24:Extrapolation Results of A1 functional form on trivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 25:Extrapolation Results of A2 functional form on trivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.

Figure 26:Extrapolation Results of A3 functional form on trivariate scaling behavior of downstream vision performance\. See Section[4\.1](https://arxiv.org/html/2605.26248#S4.SS1)for more details\.
### 18\.6Plots of Language Extrapolation Results
#### 18\.6\.1Trivariate
Figure 27:Extrapolation Results of UNSL on trivariate scaling behavior of language performance\. All 20 plots are slices of single functional form fit to a single trivariate scaling behavior\. The title of each plot represents the number of model parameters, the x\-axis of each plot represents the number of training steps times the batch size, and the color bar of each plot represents the training dataset size\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.Figure 28:Extrapolation Results of “DC” functional form ofMuennighoffet al\.\([2023](https://arxiv.org/html/2605.26248#bib.bib10)\)on trivariate scaling behavior of language performance\. All 20 plots are slices of single functional form fit to a single trivariate scaling behavior\. The title of each plot represents the number of model parameters, the x\-axis of each plot represents the number of training steps times the batch size, and the color bar of each plot represents the training dataset size\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.Figure 29:Extrapolation Results of A1 functional form on trivariate scaling behavior of language performance\. All 20 plots are slices of single functional form fit to a single trivariate scaling behavior\. The title of each plot represents the number of model parameters, the x\-axis of each plot represents the number of training steps times the batch size, and the color bar of each plot represents the training dataset size\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.Figure 30:Extrapolation Results of A2 functional form on trivariate scaling behavior of language performance\. All 20 plots are slices of single functional form fit to a single trivariate scaling behavior\. The title of each plot represents the number of model parameters, the x\-axis of each plot represents the number of training steps times the batch size, and the color bar of each plot represents the training dataset size\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.Figure 31:Extrapolation Results of A3 functional form on trivariate scaling behavior of language performance\. All 20 plots are slices of single functional form fit to a single trivariate scaling behavior\. The title of each plot represents the number of model parameters, the x\-axis of each plot represents the number of training steps times the batch size, and the color bar of each plot represents the training dataset size\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.
#### 18\.6\.2Bivariate


Figure 32:Extrapolation Results of UNSL on bivariate scaling behavior of downstream \(and upstream\) language performance\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.

Figure 33:Extrapolation Results of “CF” functional form ofHoffmannet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib66)\)on bivariate scaling behavior of downstream \(and upstream\) language performance\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.

Figure 34:Extrapolation Results of A1 functional form on bivariate scaling behavior of downstream \(and upstream\) language performance\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.

Figure 35:Extrapolation Results of A2 functional form on bivariate scaling behavior of downstream \(and upstream\) language performance\. See Section[4\.2](https://arxiv.org/html/2605.26248#S4.SS2)for more details\.
## 19Additional Related Work
There has been additional work on scaling law settings, interventions, and extensions besides those emphasized in this paper\. Scaling laws have been applied in the context of autoregressive generative modeling along various axes of scale, e\.g\., compute/model/dataset\(Henighanet al\.,[2020](https://arxiv.org/html/2605.26248#bib.bib119)\), as well as in the regime of transfer learning/fine\-tuning, i\.e\., the joint effect of the scale of the pre\-training task and the quantity of data available for downstream fine\-tuning\(Hernandezet al\.,[2021](https://arxiv.org/html/2605.26248#bib.bib120)\)\. In addition, the influence of dataset curation/selection on scaling laws is receiving increasing attention, e\.g\., via pruning or data selection with the goal of optimizing scaling exponents\(Sorscheret al\.,[2022](https://arxiv.org/html/2605.26248#bib.bib121); Ayed and Hayou,[2023](https://arxiv.org/html/2605.26248#bib.bib122)\)\. Furthermore, the problem of compute\-optimal scaling, i\.e\., the design of scaling laws beyond the number of parameters, e\.g\., with respect to depth, width, and shape, has been addressed\(Alabdulmohsinet al\.,[2023](https://arxiv.org/html/2605.26248#bib.bib123)\)\. In parallel with the problem of predictive scaling\-law fitting, a different line of work has focused on the design of explicit architectural scaling heuristics, e\.g\., compound scaling\(Tan and Le,[2019](https://arxiv.org/html/2605.26248#bib.bib124)\)\. Other learning contexts and classes of models where scaling analyses have also been pursued include reinforcement learning\(Neumann and Gros,[2023](https://arxiv.org/html/2605.26248#bib.bib125)\)and diffusion \(\-transformer\) generative models\(Lianget al\.,[2024](https://arxiv.org/html/2605.26248#bib.bib126); Liet al\.,[2024](https://arxiv.org/html/2605.26248#bib.bib127)\)\.
A related body of work has focused on the study of non\-monotonic generalization properties and multi\-regime phenomena with respect to scaling\. Much of this work has focused on double descent phenomena that depend on model capacity, sample size, and/or training time, as seen inBelkinet al\.\([2019](https://arxiv.org/html/2605.26248#bib.bib128)\); Nakkiranet al\.\([2019](https://arxiv.org/html/2605.26248#bib.bib129)\); Hastieet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib130)\); Adlam and Pennington \([2020](https://arxiv.org/html/2605.26248#bib.bib131)\)\. Other related statistical physics viewpoints on non\-monotonicity and sharp transitions in terms of phase transitions/jamming are presented inSpigleret al\.\([2019](https://arxiv.org/html/2605.26248#bib.bib132)\)\. Another noteworthy example of non\-monotonic behavior depending on optimization and regularization is “Grokking,” or sudden late generalization far after the start of memorization, as seen inPoweret al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib134)\); Liuet al\.\([2022](https://arxiv.org/html/2605.26248#bib.bib135);[2023](https://arxiv.org/html/2605.26248#bib.bib136)\)\.
The application of scaling laws in forecasting is closely related to the problem of learning curve extrapolation, multi\-fidelity hyperparameter optimization, and the scaling of hyperparameters\. Because scaling laws are used operationally for forecasting on the basis of partial training runs, this problem is related to learning curve extrapolation and early termination of hyperparameter search, as discussed in\(Domhanet al\.,[2015](https://arxiv.org/html/2605.26248#bib.bib137); Swerskyet al\.,[2014](https://arxiv.org/html/2605.26248#bib.bib139)\)\. It is also related to multi\-fidelity/bandit and BO\-based methods for HPO, as discussed in\(Liet al\.,[2018](https://arxiv.org/html/2605.26248#bib.bib140); Falkneret al\.,[2018](https://arxiv.org/html/2605.26248#bib.bib141); Snoeket al\.,[2012](https://arxiv.org/html/2605.26248#bib.bib142); Hutteret al\.,[2011](https://arxiv.org/html/2605.26248#bib.bib144)\)\. In the area of large batch training, there are studies on scaling rules for schedules and optimizers, which analyze how optimal hyperparameters \(such as the learning rate, batch size, and momentum\) change with scale and batch size, as discussed in\(Goyalet al\.,[2017](https://arxiv.org/html/2605.26248#bib.bib145); Smithet al\.,[2017](https://arxiv.org/html/2605.26248#bib.bib146); Youet al\.,[2017](https://arxiv.org/html/2605.26248#bib.bib147); McCandlishet al\.,[2018](https://arxiv.org/html/2605.26248#bib.bib148)\)\. Another line of work on theory\-driven parameterization of schedules aims at enabling zero\-shot transfer of hyperparameters across network widths \(such as inμ\\muP\) as discussed in\(Yanget al\.,[2022](https://arxiv.org/html/2605.26248#bib.bib149)\)\.
Finally, the study of “emergent abilities” points out that the observation of clear transitions with respect to scale may actually indicate either real regime changes or the effects of nonlinear/discontinuous metrics and estimation at finite samples\(Weiet al\.,[2022](https://arxiv.org/html/2605.26248#bib.bib150); Schaefferet al\.,[2023](https://arxiv.org/html/2605.26248#bib.bib151)\)\.
## 20Implementation of UNSL
We recommend using a system with at least a dozen CPUs if one wants the code to find global optima quickly\.
Some notes about the code:
- •The training dataset always is passed in as one of the inputs to the functional form \(via batch\_train argument\) \(e\.g\. even when evaluating on test dataset\)\. This is because the inputs and outputs of functional form are z\-normalized using the mean and std of the training dataset in log space\. This is analogous to how batch normalization requires always using the mean and std of the training dataset \(even when evaluating on test dataset\)\. This z\-normalization causes the functional form to converge to the global optimum faster and using less seeds\. This z\-normalization has no effect on the expressivity of the functional form\.
- •The parallel “executor\.submit” part will get stuck \(and stop making progress\) if any jax arrays are outside of optimize\_model function\.
- •If you get an error related to parallelization, decrease “n\_s” variable or use a system with more CPUs\.
To run this code:
1. 1\.Copy this code and replace each ␣ \(that appears when pasting\) with a space\.
2. 2\.Run this sequence of ipython cells in order\.
`In \[1\]: In \[2\]: In \[3\]: In \[4\]: In \[5\]: In \[6\]: In \[7\]: In \[8\]: In \[9\]: In \[10\]:`
`References Exploring the limits of large scale pre\-training\. External Links: 2110\.02095 Cited by: §1\. B\. Adlam and J\. Pennington \(2020\) Understanding double descent requires a fine\-grained bias\-variance decomposition\. In Advances in Neural Information Processing Systems, Vol\. 33, pp\. 11022–11032\. Cited by: §19\. I\. M\. Alabdulmohsin, X\. Zhai, A\. Kolesnikov, and L\. Beyer \(2023\) Getting ViT in shape: scaling laws for compute\-optimal model design\. In Advances in Neural Information Processing Systems, Vol\. 36\. Cited by: §19\. I\. M\. I\. Alabdulmohsin, B\. Neyshabur, and X\. Zhai \(2022\) Revisiting neural scaling laws in language and vision\. In NeurIPS 2022, External Links: Link Cited by: §1, §4\.1\. F\. Ayed and S\. Hayou \(2023\) Data pruning and neural scaling laws: fundamental limitations of score\-based algorithms\. Transactions on Machine Learning Research\. External Links: ISSN 2835\-8856 Cited by: §19\. Y\. Bahri, E\. Dyer, J\. Kaplan, J\. Lee, and U\. Sharma \(2021\) Explaining neural scaling laws\. arXiv preprint arXiv:2102\.06701\. Cited by: §1\. B\. Barak, B\. Edelman, S\. Goel, S\. Kakade, E\. Malach, and C\. Zhang \(2022\) Hidden progress in deep learning: sgd learns parities near the computational limit\. Advances in Neural Information Processing Systems 35, pp\. 21750–21764\. Cited by: Figure 5, Figure 6, Figure 7, Figure 4, §5\. M\. Belkin, D\. Hsu, S\. Ma, and S\. Mandal \(2019\) Reconciling modern machine\-learning practice and the classical bias–variance trade\-off\. Proceedings of the National Academy of Sciences 116 \(32\), pp\. 15849–15854\. Cited by: §19\. Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\) Piqa: reasoning about physical commonsense in natural language\. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol\. 34, pp\. 7432–7439\. Cited by: §4\.2\. A\. Botev and J\. Martens \(2022\) KFAC\-JAX External Links: Link Cited by: §10\. T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, et al\. \(2020\) Language models are few\-shot learners\. Advances in neural information processing systems 33, pp\. 1877–1901\. Cited by: §1\. E\. Caballero, K\. Gupta, I\. Rish, and D\. Krueger \(2023\) Broken neural scaling laws\. In The Eleventh International Conference on Learning Representations, External Links: Link Cited by: §1, §16\.1, §16\.1, §16\.1, item 1, §2\. C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\) BoolQ: exploring the surprising difficulty of natural yes/no questions\. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\), pp\. 2924–2936\. Cited by: §4\.2\. P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\) Think you have solved question answering? try arc, the ai2 reasoning challenge\. External Links: 1803\.05457 Cited by: §4\.2\. K\. Cobbe, C\. Hesse, J\. Hilton, and J\. Schulman \(2020\) Leveraging procedural generation to benchmark reinforcement learning\. In International conference on machine learning, pp\. 2048–2056\. Cited by: Figure 13\. C\. Cortes, L\. D\. Jackel, S\. A\. Solla, V\. Vapnik, and J\. S\. Denker \(1994\) Learning curves: asymptotic values and rate of convergence\. In Advances in Neural Information Processing Systems, pp\. 327–334\. Cited by: §1\. G\. Cybenko \(1989\) Approximation by superpositions of a sigmoidal function\. Mathematics of Control, Signals and Systems 2 \(4\), pp\. 303–314\. External Links: Document Cited by: §15\. J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\) Imagenet: a large\-scale hierarchical image database\. In 2009 IEEE conference on computer vision and pattern recognition, pp\. 248–255\. Cited by: §4\.1\. T\. Domhan, J\. T\. Springenberg, and F\. Hutter \(2015\) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves\. In International Joint Conference on Artificial Intelligence \(IJCAI\), pp\. 3460–3468\. Cited by: §19\. A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, et al\. \(2020\) An image is worth 16x16 words: transformers for image recognition at scale\. arXiv preprint arXiv:2010\.11929\. Cited by: §4\.1\. S\. Falkner, A\. Klein, and F\. Hutter \(2018\) BOHB: robust and efficient hyperparameter optimization at scale\. In International Conference on Machine Learning, pp\. 1437–1446\. Cited by: §19\. P\. Goyal, P\. Dollár, R\. Girshick, P\. Noordhuis, L\. Wesolowski, A\. Kyrola, A\. Tulloch, Y\. Jia, and K\. He \(2017\) Accurate, large minibatch SGD: training ImageNet in 1 hour\. arXiv preprint arXiv:1706\.02677\. Cited by: §19\. S\. Greydanus and D\. Kobak \(2024\) Scaling down deep learning with MNIST\-1D\. In Proceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol\. 235, pp\. 16404–16415\. Cited by: Figure 11, Figure 12, Figure 8, Figure 9, §17\.7, Figure 15, Figure 16\. T\. Hastie, A\. Montanari, S\. Rosset, and R\. J\. Tibshirani \(2022\) Surprises in high\-dimensional ridgeless least squares interpolation\. Annals of Statistics 50 \(2\), pp\. 949–986\. Cited by: §19\. T\. Henighan, J\. Kaplan, M\. Katz, M\. Chen, C\. Hesse, J\. Jackson, H\. Jun, T\. B\. Brown, P\. Dhariwal, S\. Gray, C\. Hallacy, B\. Mann, A\. Radford, A\. Ramesh, N\. Ryder, D\. M\. Ziegler, J\. Schulman, D\. Amodei, and S\. McCandlish \(2020\) Scaling laws for autoregressive generative modeling\. arXiv preprint arXiv:2010\.14701\. Cited by: §19\. D\. Hernandez, J\. Kaplan, T\. Henighan, and S\. McCandlish \(2021\) Scaling laws for transfer\. arXiv preprint arXiv:2102\.01293\. Cited by: §19\. J\. Hestness, S\. Narang, N\. Ardalani, G\. Diamos, H\. Jun, H\. Kianinejad, Md\. M\. A\. Patwary, Y\. Yang, and Y\. Zhou \(2017\) Deep Learning Scaling is Predictable, Empirically\. arXiv e\-prints, pp\. arXiv:1712\.00409\. External Links: 1712\.00409 Cited by: §1\. J\. Hilton, J\. Tang, and J\. Schulman \(2023\) Scaling laws for single\-agent reinforcement learning\. arXiv preprint arXiv:2301\.13442\. Cited by: Figure 13\. J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, et al\. \(2022\) Training compute\-optimal large language models\. arXiv preprint arXiv:2203\.15556\. Cited by: §12, §17\.3, §17\.4, Figure 33, §3, §4\.2\. K\. Hornik \(1991\) Approximation capabilities of multilayer feedforward networks\. Neural Networks 4 \(2\), pp\. 251–257\. External Links: Document Cited by: §15\. F\. Hutter, H\. H\. Hoos, and K\. Leyton\-Brown \(2011\) Sequential model\-based optimization for general algorithm configuration\. In Learning and Intelligent Optimization \(LION\), pp\. 507–523\. Cited by: §19\. J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\) Scaling Laws for Neural Language Models\. arXiv e\-prints, pp\. arXiv:2001\.08361\. External Links: 2001\.08361 Cited by: §1, §3\. A\. Kolesnikov, L\. Beyer, X\. Zhai, J\. Puigcerver, J\. Yung, S\. Gelly, and N\. Houlsby \(2020\) Big transfer \(bit\): general visual representation learning\. In European conference on computer vision, pp\. 491–507\. Cited by: §4\.1\. J\. Krause, M\. Stark, J\. Deng, and L\. Fei\-Fei \(2013\) 3d object representations for fine\-grained categorization\. In Proceedings of the IEEE international conference on computer vision workshops, pp\. 554–561\. Cited by: §4\.1\. M\. Leshno, V\. Y\. Lin, A\. Pinkus, and S\. Schocken \(1993\) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function\. Neural Networks 6 \(6\), pp\. 861–867\. External Links: Document Cited by: §15\. H\. Li, Y\. Zou, Y\. Wang, O\. Majumder, Y\. Xie, R\. Manmatha, A\. Swaminathan, Z\. Tu, S\. Ermon, and S\. Soatto \(2024\) On the scalability of diffusion\-based text\-to\-image generation\. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\), pp\. 9400–9409\. Cited by: §19\. L\. Li, K\. Jamieson, G\. DeSalvo, A\. Rostamizadeh, and A\. Talwalkar \(2018\) Hyperband: a novel bandit\-based approach to hyperparameter optimization\. Journal of Machine Learning Research 18 \(185\), pp\. 1–52\. Cited by: §19\. Z\. Liang, H\. He, C\. Yang, and B\. Dai \(2024\) Scaling laws for diffusion transformers\. arXiv preprint arXiv:2410\.08184\. Cited by: §19\. Z\. Liu, O\. Kitouni, N\. S\. Nolte, E\. J\. Michaud, M\. Tegmark, and M\. Williams \(2022\) Towards understanding grokking: an effective theory of representation learning\. In Advances in Neural Information Processing Systems, Vol\. 35\. Cited by: §19\. Z\. Liu, E\. J\. Michaud, and M\. Tegmark \(2023\) Omnigrok: grokking beyond algorithmic data\. In International Conference on Learning Representations, Cited by: §19\. Mathematical Association of America \(2024\) American invitational mathematics examination 2024\. Note: https://artofproblemsolving\.com/wiki/index\.php/American\_Invitational\_Mathematics\_ExaminationAccessed: 2025\-10\-05 Cited by: Figure 14\. S\. McCandlish, J\. Kaplan, D\. Amodei, and OpenAI Dota Team \(2018\) An empirical model of large\-batch training\. arXiv preprint arXiv:1812\.06162\. Cited by: §19\. T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\) Can a suit of armor conduct electricity? a new dataset for open book question answering\. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp\. 2381–2391\. Cited by: §4\.2\. N\. Muennighoff, A\. M\. Rush, B\. Barak, T\. L\. Scao, A\. Piktus, N\. Tazi, S\. Pyysalo, T\. Wolf, and C\. Raffel \(2023\) Scaling data\-constrained language models\. arXiv preprint arXiv:2305\.16264\. Cited by: Figure 18, Figure 23, Figure 28, §3, §4\.2\. P\. Nakkiran, G\. Kaplun, Y\. Bansal, T\. Yang, B\. Barak, and I\. Sutskever \(2019\) Deep double descent: where bigger models and more data hurt\. arXiv preprint arXiv:1912\.02292\. Cited by: §19\. O\. Neumann and C\. Gros \(2023\) Scaling laws for a multi\-agent reinforcement learning model\. arXiv preprint arXiv:2210\.00849\. Cited by: §19\. A\. Power, Y\. Burda, H\. Edwards, I\. Babuschkin, and V\. Misra \(2022\) Grokking: generalization beyond overfitting on small algorithmic datasets\. arXiv preprint arXiv:2201\.02177\. Cited by: §19\. J\. S\. Rosenfeld, A\. Rosenfeld, Y\. Belinkov, and N\. Shavit \(2019\) A constructive prediction of the generalization error across scales\. CoRR abs/1909\.12673\. External Links: Link, 1909\.12673 Cited by: §1, §3\. R\. Sadhukhan, Z\. Chen, H\. Zheng, Y\. Zhou, E\. Strubell, and B\. Chen \(2025\) Kinetics: rethinking test\-time scaling laws\. arXiv preprint arXiv:2506\.05333\. Cited by: Figure 14\. K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2020\) Winogrande: an adversarial winograd schema challenge at scale\. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol\. 34, pp\. 8732–8740\. Cited by: §4\.2\. M\. Sap, H\. Rashkin, D\. Chen, R\. LeBras, and Y\. Choi \(2019\) Socialiqa: commonsense reasoning about social interactions\. In Proceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\), pp\. 4463–4473\. Cited by: §4\.2\. R\. Schaeffer, B\. Miranda, and S\. Koyejo \(2023\) Are emergent abilities of large language models a mirage?\. In Advances in Neural Information Processing Systems, Vol\. 36, pp\. 55565–55581\. Cited by: §19\. X\. Shen, D\. Li, R\. Leng, Z\. Qin, W\. Sun, and Y\. Zhong \(2024\) Scaling laws for linear complexity language models\. arXiv preprint arXiv:2406\.16690\. Cited by: §4\.2\. S\. L\. Smith, P\. Kindermans, C\. Ying, and Q\. V\. Le \(2017\) Don’t decay the learning rate, increase the batch size\. arXiv preprint arXiv:1711\.00489\. Cited by: §19\. J\. Snoek, H\. Larochelle, and R\. P\. Adams \(2012\) Practical Bayesian optimization of machine learning algorithms\. In Advances in Neural Information Processing Systems, Vol\. 25\. Cited by: §19\. B\. Sorscher, R\. Geirhos, S\. Shekhar, S\. Ganguli, and A\. S\. Morcos \(2022\) Beyond neural scaling laws: beating power law scaling via data pruning\. In Advances in Neural Information Processing Systems, Vol\. 35, pp\. 19523–19536\. Cited by: §19\. S\. Spigler, M\. Geiger, S\. d’Ascoli, L\. Sagun, G\. Biroli, and M\. Wyart \(2019\) A jamming transition from under\- to over\-parametrization affects generalization in deep learning\. Journal of Physics A: Mathematical and Theoretical 52 \(47\), pp\. 474001\. Cited by: §19\. C\. Sun, A\. Shrivastava, S\. Singh, and A\. Gupta \(2017\) Revisiting unreasonable effectiveness of data in deep learning era\. In Proceedings of the IEEE international conference on computer vision, pp\. 843–852\. Cited by: §4\.1\. R\. Sutton \(2019\) The bitter lesson\. Incomplete Ideas \(blog\)\. External Links: Link Cited by: §1\. K\. Swersky, J\. Snoek, and R\. P\. Adams \(2014\) Freeze\-thaw Bayesian optimization\. arXiv preprint arXiv:1406\.3896\. Cited by: §19\. M\. Tan and Q\. V\. Le \(2019\) EfficientNet: rethinking model scaling for convolutional neural networks\. In International Conference on Machine Learning, pp\. 6105–6114\. Cited by: §19\. I\. O\. Tolstikhin, N\. Houlsby, A\. Kolesnikov, L\. Beyer, X\. Zhai, T\. Unterthiner, J\. Yung, A\. Steiner, D\. Keysers, J\. Uszkoreit, et al\. \(2021\) Mlp\-mixer: an all\-mlp architecture for vision\. Advances in Neural Information Processing Systems 34, pp\. 24261–24272\. Cited by: §1, §4\.1\. J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler, E\. H\. Chi, T\. Hashimoto, O\. Vinyals, P\. Liang, J\. Dean, and W\. Fedus \(2022\) Emergent abilities of large language models\. Transactions on Machine Learning Research\. External Links: ISSN 2835\-8856 Cited by: §19\. P\. Welinder, S\. Branson, T\. Mita, C\. Wah, F\. Schroff, S\. Belongie, and P\. Perona \(2010\) Caltech\-ucsd birds 200\. Cited by: §4\.1\. G\. Yang, E\. J\. Hu, I\. Babuschkin, S\. Sidor, X\. Liu, D\. Farhi, N\. Ryder, J\. Pachocki, W\. Chen, and J\. Gao \(2022\) Tensor programs V: tuning large neural networks via zero\-shot hyperparameter transfer\. arXiv preprint arXiv:2203\.03466\. Cited by: §19\. Y\. You, I\. Gitman, and B\. Ginsburg \(2017\) Large batch training of convolutional networks\. arXiv preprint arXiv:1708\.03888\. Cited by: §19\. R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\) Hellaswag: can a machine really finish your sentence?\. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp\. 4791–4800\. Cited by: §4\.2\. X\. Zhai, A\. Kolesnikov, N\. Houlsby, and L\. Beyer \(2021\) Scaling vision transformers\. CoRR abs/2106\.04560\. External Links: Link, 2106\.04560 Cited by: §1\. X\. Zhai, A\. Kolesnikov, N\. Houlsby, and L\. Beyer \(2022\) Scaling vision transformers\. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp\. 12104–12113\. Cited by: §4\.1\.`Similar Articles
Unified Neural Scaling Laws
Presents a unified neural scaling law that accurately models deep neural network scaling across multiple dimensions including parameters, dataset size, training steps, and compute, validated across diverse architectures and tasks.
Scaling laws for neural language models
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.
Spectral Scaling Laws of Muon
This paper presents the first systematic study of singular value spectral behavior in Muon optimizer momentum matrices during LLM training, discovering clean power-law scaling relationships across model sizes (77M–2.8B parameters). The findings provide practitioners with principled, layer-aware guidelines for configuring Newton–Schulz iterations to maintain orthonormalization quality at frontier scale without unnecessary computation.
UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling
Proposes UniScale, an online framework that unifies model routing and test-time scaling via contextual bandit optimization for better quality-cost trade-offs in LLM inference.
Prescriptive Scaling Laws for Data Constrained Training
A modified scaling law accounting for data repetition effects provides compute-optimal training strategies for data-constrained scenarios, showing that beyond a point further repetition is counterproductive and compute is better spent on model capacity.