On the Smallness of the Large Language Models Scaling Exponents
Summary
The paper discusses the small scaling exponents of large language models, arguing that they indicate an unsustainable regime in terms of energy resources. It also examines the 'pedestal effect' and draws analogies with fluid turbulence to comment on data smoothness.
View Cached Full Text
Cached at: 06/24/26, 07:47 AM
# On the Smallness of the Large Language Models Scaling Exponents
Source: [https://arxiv.org/html/2606.24504](https://arxiv.org/html/2606.24504)
Sauro Succi1,2, Peter V\. Coveney3and Alex Hansen2
1Italian Institute of Technology, Viale Regina Elena, 291, 00161, Rome, Italy
2PoreLab, Physics Department, Norwegian University of Science and Technology, 7491 Trondheim, Norway
3Centre for Computational Science, Chemistry Department, University College of London, 20 Gordon Street, WC1H 0AJ, London, United Kingdom
###### Abstract
We discuss reasons why the scaling exponents of current Large Language Models \(LLMs\) applications are indicating an unsustainable regime in terms of energy resources\. We further show that attributing the smallness of such exponents to a numerical bias due to the neglect of a non\-zero value of the loss function in the limit of infinite data \(“pedestal effect”\) does not remove the unsustainability issue\. Finally, the effects of the smoothness \(roughness\) of the data on the scaling exponents is commented upon based on an analogy with phenomenological models of fluid turbulence\.
## 1Introduction
AI and most notably transformers\-based large Language Models \(LLM\) have taken science and society by storm in the recent past\[[1](https://arxiv.org/html/2606.24504#bib.bib1),[2](https://arxiv.org/html/2606.24504#bib.bib2),[3](https://arxiv.org/html/2606.24504#bib.bib3)\]\. Much of this explosive growth is driven by the “no\-wall” finding, namely the fact that the learning capability of LLM\-based chatbots keeps increasing with size, whence the mantra “bigger is better” which has driven much of the current leading\-edge LLMs research\.
Let us recapitulate the main ideas behind the “no\-wall” finding\.
The scaling law of LLMs is usually expressed in the form
L\(N\)=A/Nα,L\(N\)=A/N^\{\\alpha\}\\\>,\(1\)
whereLLis the loss function, a measure of the departure of the LLM outcome from the desired target, andNNstands for data \(DD\), number of parameters \(PP\) and computational cost \(CC\), respectively, each characterized by its constant prefactorAAand exponentα\\alpha\[[4](https://arxiv.org/html/2606.24504#bib.bib4),[5](https://arxiv.org/html/2606.24504#bib.bib5)\]\. A positive value of the exponent indicates that the loss function decreases for increasing data size, so that using larger data sets leads to a closer match to the desired target\. Formally, the so\-called “wall” corresponds to the opposite regime, in which more data lead to a larger discrepancy between model and target, as formally reflected by a negative exponent\. The LLMs exponents so far have remained positive across several decades, whence the enthusiastic claim that “there is no wall,” support a strategy oriented towards ever larger \(and energy consuming\) LLM applications\[[6](https://arxiv.org/html/2606.24504#bib.bib6)\]\.
While the accomplishment is remarkable under all scientific counts, many researchers have observed that the LLMs scaling exponents are very small, typically in the range0\.05÷0\.100\.05\\div 0\.10, pointing to a regime of “diminishing returns” for pre\-trained LLMs\. In a recent paper, two of the present authors have pointed out what “diminishing returns” really means in actual practice: an exponent0\.10\.1means that cutting the loss function down by a factor 2 requires210=10242^\{10\}=1024more resources\[[7](https://arxiv.org/html/2606.24504#bib.bib7)\]\. It was then argued that, while formally wall\-free, such regime is simply unsustainable, whence the need of new directions paying more attention to physical insight and compliance with world models than muscular leverage of the number of parameters\[[8](https://arxiv.org/html/2606.24504#bib.bib8),[9](https://arxiv.org/html/2606.24504#bib.bib9),[10](https://arxiv.org/html/2606.24504#bib.bib10),[11](https://arxiv.org/html/2606.24504#bib.bib11)\]\.
Among the critical feedback spawned by this simple observation, a recurrent one is that the loss function cannot be paralleled to numerical discretization errors, the reason being that the ML procedures do not necessarily aim at sending the loss function to zero, but are typically stopped before that limit is approached, usually in order to forestall overfitting and ensuing problems in generalizing, i\.e\., the ability to reproduce unseen targets\[[3](https://arxiv.org/html/2606.24504#bib.bib3)\]\.
In the following, we argue that such criticism does not change the conclusion on the unsustainability of LLMs scaling exponents\.
## 2The loss function as a pseudo\-metric of accuracy
The approximation error associated with a numerical method employingNNdegrees of freedom, say the grid discretization of a PDE, usually follows an asymptotic scaling relation of the form
E\(N\)=A/Na,E\(N\)=A/N^\{a\}\\;,\(2\)wherea\>0a\>0is the order of accuracy andAAa prefactor measuring the “critical” sizeNc=A1/aN\_\{c\}=A^\{1/a\}above which the error starts to show a power\-law decay\. Most grid methods work arounda=2a=2, while stochastic particle methods, such as Monte Carlo, featurea∼1/2a\\sim 1/2\. Note thata=2a=2means that reducing the error by a factor22takes just2\\sqrt\{2\}more resources \(NN\), while witha=1/2a=1/2this number is44, which is generally regarded as poor convergence\.
The above relation encodes a basic requirement on any well\-posed numerical scheme, called ”Consistency”, namely that upon sendingN→∞N\\to\\inftythe error should vanish, so as to reproduce the original target, generally the analytical solution of a continuum differential equation\. Any non zero valueE∞≡E\(N→∞\)E\_\{\\infty\}\\equiv E\(N\\to\\infty\)\(the ”pedestal”\), or, worst, increasing error at increasing resolution \(the ”wall”\) is regarded as an anomaly, usually caused by some form of ill\-posedeness of the numerical discretization, typically the breaking of a basic continuum symmetry by the discrete scheme\.
The above relation bears a direct analogy to the LLM scaling law\[[4](https://arxiv.org/html/2606.24504#bib.bib4)\], whence the simple conclusion that an exponentα=0\.1\\alpha=0\.1is simply unsustainable\. Claiming that this analogy is flawed because machine learning practices do not concern themselves with the limitL→0L\\to 0asN→∞N\\to\\inftyis tantamount to saying that consistency, a prime requirement for any well\-posed numerical method, is not relevant to machine learning as a scientific discipline\.
While it is not hard to see reasons why one might be happy to settle with “small enough” values of the loss function \(the so\-called early\-stop empirical practice\), it remains undeniable that dismissing consistency in favor of ambiguous “small enough” criteria, leaves much to be desired in terms of reliability of the methodology as a systematic scientific method\. This is especially true with regard to the ability to generalize to unseen data, which is the essence of true learning as opposed to mere memorization of the training data\. Yet, it is true that matching a given set of data is not equivalent to reproducing the solution of a continuum PDE, so let us accept the loss function as a sort of empiricalpseudo\-metric in no need to comply with the consistency requirement\. In the following we shall argue that even this “lenient perspective” does not affect the claim of unsustainability made in\[[7](https://arxiv.org/html/2606.24504#bib.bib7)\]\. The point is no longer consistency, but efficiency, namely how fast does the error decrease at increasing resources, which is precisely dictated by the actual value of the scaling exponent\.
## 3The pedestal effect
Recently, a group of authors\[[12](https://arxiv.org/html/2606.24504#bib.bib12)\]have built on the so called Chinchilla scaling\[[13](https://arxiv.org/html/2606.24504#bib.bib13)\]to argue that LLM scaling exponents published by Anthropic and subsequent works are “biased” by the fact of ignoring a “pedestal” in the scaling laws, namely the fact that the loss function does not vanish in the ”continuum limit”N→∞N\\to\\infty\.
To appreciate the point, let us begin by casting the loss function in the form
L\(x\)=L0\+\(L1−L0\)xα,L\(x\)=L\_\{0\}\+\(L\_\{1\}\-L\_\{0\}\)\\;x^\{\\alpha\}\\;,\(3\)where we have setx≡1/Nx\\equiv 1/Nfor the sake of convenience, so thatL0=L\(x=0\)L\_\{0\}=L\(x=0\)is the continuum limit andL1=L\(x=1\)L\_\{1\}=L\(x=1\)is the large\-scale limitN=1N=1\. Note thatα\>0\\alpha\>0denotes the “no\-wall” regime in whichLLgrows withx=1/Nx=1/N\.
Clearly, the pedestalL0L\_\{0\}introduces its own exponent,α=0\\alpha=0, so that the “effective” exponent associated with the above relation must necessarily lie between0andα\\alpha, the precise form of the transition depending on the ratiop=L0/L1p=L\_\{0\}/L\_\{1\}\. To highlight the point, let us rescaleL→L/L1L\\to L/L\_\{1\}and write
L\(x;p\)=px0\+\(1−p\)xα\.L\(x;p\)=p\\;x^\{0\}\+\(1\-p\)\\;x^\{\\alpha\}\\;\.\(4\)
Next, let us define a “running” scaling exponent associated to a given pedestalppas
αp\(x\)=xL′\(x,p\)L\(x,p\),\\alpha\_\{p\}\(x\)=\\frac\{xL^\{\\prime\}\(x,p\)\}\{L\(x,p\)\}\\;,\(5\)where prime denotes derivative with respect toxx\. Clearly, the running exponentαp\(x\)\\alpha\_\{p\}\(x\)returns a constant only in the case of a single\-exponent power law behavior, specificallyαp=0\(x\)=α\\alpha\_\{p=0\}\(x\)=\\alphaandαp=1=0\\alpha\_\{p=1\}=0\.
Such function is reported in Fig\. 1, forp=0\.01,0\.05,0\.1,0\.2p=0\.01,0\.05,0\.1,0\.2for the case of Chinchilla scalingα≡αC∼1/3\\alpha\\equiv\\alpha\_\{C\}\\sim 1/3\. This figure shows that the transition from the Chinchilla \(C\) to the Anthropic \(A\) withα∼0\.05÷0\.1\\alpha\\sim 0\.05\\div 0\.1regime is rather sharp, indicating that it does not take extremely large datasets to transit from “high” \(C\) to “low” \(A\) scaling regimes\. The above observation highlights the major relevance of the pedestal effect in assessing the scaling performance of LLMs\.
Figure 1:The running scaling exponent as a function of1/N1/NforαC=1/3\\alpha\_\{C\}=1/3\(solid horizontal line\) andp=0\.01,0\.05,0\.1,0\.2p=0\.01,0\.05,0\.1,0\.2\(right to left to top\)\. The figure clearly shows that transition from1/31/3to the A\-region0\.05<α<0\.10\.05<\\alpha<0\.1takes place for sizes well below the largest LLM applications\. Whence the dominance of the pedestal effect in controlling the scaling properties of such applications\. The values are divided by\(1−p\)\(1\-p\)to ensure the conditionα\(x=1,p\)=1\\alpha\(x=1,p\)=1for any value ofpp\.The same idea can be formulated by introducing a critical thresholdxcrit\(p\)x\_\{crit\}\(p\)below which the low\-exponent A regime becomes the dominant one\. Based on the expression \([4](https://arxiv.org/html/2606.24504#S3.E4)\), this occurs under the condition
p≫\(1−p\)xαC,p\\gg\(1\-p\)x^\{\\alpha\_\{C\}\}\\;,namely, whenever
x≪xcrit\(p\)=\(p1−p\)1/αC\.x\\ll x\_\{crit\}\(p\)=\(\\frac\{p\}\{1\-p\}\)^\{1/\\alpha\_\{C\}\}\\;\.\(6\)Hence, forαC=1/3\\alpha\_\{C\}=1/3, and switching back to the sizeNN, we obtain
N≫Ncrit\(p\)=\(1−pp\)3\.N\\gg N\_\{crit\}\(p\)=\(\\frac\{1\-p\}\{p\}\)^\{3\}\\;\.\(7\)Fig\. 2 shows the boundaryNcrit\(p\)N\_\{crit\}\(p\)between the C and A regimes as a function ofpp\. From this figure, it is clear that it takes very small values of the pedestal in order for the C regime to be the dominant for large data sets\. Differently restated, by inverting the relation \([8](https://arxiv.org/html/2606.24504#S3.E8)\), we obtain
p≪pcrit\(N\)=11\+N1/3∼N−1/3\.p\\ll p\_\{crit\}\(N\)=\\frac\{1\}\{1\+N^\{1/3\}\}\\sim N^\{\-1/3\}\\;\.\(8\)Hence, even for a moderate size dataset withN=106N=10^\{6\}, we already havepcrit∼10−2p\_\{crit\}\\sim 10^\{\-2\}\. Since the loss function in most LLMs applications decays by about half a decade over several decades inNN, the pedestal value is well above0\.010\.01, showing that the Chinchilla exponent is not relevant to the scaling performance of large datasets such as the ones used in modern LLMs applications\.
Figure 2:The critical boundaryNcrit\(p\)=\(1−pp\)3N\_\{crit\}\(p\)=\(\\frac\{1\-p\}\{p\}\)^\{3\}between C and A scaling regions, as a function of the pedestalpp\. The C\-scaling is only relevant below the critical boundary, hence for relatively small datasets size, unless the pedestal is made unrealistically small\.
## 4Theoretical prediction of the scaling exponents
Having pinpointed the importance of the pedestal in assessing the effective scaling properties of LLMs, the natural question is to look for a theoretical explanation of their value\. A simple yet quite convincing explanation was provided by Sharma and Kaplan \(SK hereafter\)\[[14](https://arxiv.org/html/2606.24504#bib.bib14)\], whose work lends further credit to the relevance of the A\-exponents as a proper measure of LLMs scaling performance\.
In this paper present a simple and convincing toy\-theory, supported by experimental data, that the LLM scaling exponents obey the following lower bound,
α∼4d,\\alpha\\sim\\frac\{4\}\{d\}\\;,\(9\)whereddis the Intrinsic Dimension of the manifold, namely the number of independent coordinates required to describe the manifold where most data reside\[[15](https://arxiv.org/html/2606.24504#bib.bib15)\]\.
Note that for many complex applicationsd≪Dd\\ll D,DDbeing the dimension of the embedding space\. The above relation is remarkable for its simplicity, elegance and robustness across a broad ensemble of numerical datasets\.
A few comments are in order\.
First,note that the SK relation refers to concrete LLMs practice, hence it is fully consistent with the pseudo\-metric perspective discussed above\.
Second,let us recall that inDDdimensions, a grid discretization method of orderaa, features an exponent
α\(D\)=aD\.\\alpha\(D\)=\\frac\{a\}\{D\}\\;\.\(10\)This results by the sheer observation that the volumeV\(D\)V\(D\)of a D\-dimensional region of space of diameterδ\\deltascales likeV\(D\)∼δDV\(D\)\\sim\\delta^\{D\}, and standard \(non\-adaptive\) grid discretization treats all portions of the D\-dimensional space on the same footing, regardless of whether or not they host any interesting information process \(typically they do not\)\. This is the infamous Curse of Dimensionality \(CoD\)\[[16](https://arxiv.org/html/2606.24504#bib.bib16)\]\.
Third,the expression \([10](https://arxiv.org/html/2606.24504#S4.E10)\) invites a natural analogy with the SK relation withD=dD=danda=4a=4\. This means that LLMs can be paralleled to adaptive grid refinement methods witha=4a=4in add\-dimensional feature space\. By adaptive, we mean methods which place the numerical degrees of freedom ”on demand”, i\.e\., there where the relevant information is located and not everywhere in a D\-dimensional region of space\. This ability to spot the relevant low\-dimensional manifold where the computational resources should be focused is essential to \(partially\) tame the CoD, possibly one of the major achievements of LLMs research\.
Let us expand on the above by revisiting the main ideas behind the SK formula and connect them with the phenomenology of fluid turbulence\.
## 5The SK model and connections with turbulent fractals
Let us consider add\-dimensional cube of side11and fill it in withN\(s\)N\(s\)cublets of sidess\. By definitionN\(s\)sd=1d=1N\(s\)s^\{d\}=1^\{d\}=1, whence
N\(s\)=s−d\.N\(s\)=s^\{\-d\}\\;\.\(11\)Next, let us define the loss function as the numerical error associated with a piecewise representationfN\(x\)f\_\{N\}\(x\)of asmoothfunctionf\(x\)f\(x\)within the unit side cube
L\(s\)=∫\|f\(x\)−fN\(x\)\|2ddxL\(s\)=\\int\|f\(x\)\-f\_\{N\}\(x\)\|^\{2\}d\_\{d\}x\(12\)Since the functionf\(x\)f\(x\)is smooth, there is always a positionx′x^\{\\prime\}in the hypercube, such thatfN\(x\)=f\(x′\)f\_\{N\}\(x\)=f\(x^\{\\prime\}\)\. Hence, by virtue of smoothness, the integrand in the above integral is bounded bygsd1/2gsd^\{1/2\}, whereggis a constant proportional to the gradient offfandsd1/2sd^\{1/2\}is the diameter of the hypercube\. As a result, we obtainL\(s\)∼g2s2dL\(s\)\\sim g^\{2\}s^\{2\}d, namely, based on \([11](https://arxiv.org/html/2606.24504#S5.E11)\),
L\(N\)∼N−2/d,L\(N\)\\sim N^\{\-2/d\}\\;,\(13\)whenceα=2/d\\alpha=2/d\. The same argument as applied to a piecewise linear representation off\(x\)f\(x\)returnss4s^\{4\}instead ofs2s^\{2\}, whence the sought4/d4/dexponent\. Essentially the scaling exponent is dictated by the smoothness \(roughness\) of the target function, the accuracy of the discretization and the dimensionality of the intrinsic manifold\.
KS justly evoke a connection between adaptive refinement and fractal sets, and in the following we expand such analogy in semi\-quantitative terms building on standard arguments from the fractal theory of fluid turbulence\[[17](https://arxiv.org/html/2606.24504#bib.bib17)\]\. The purpose is by no means to claim a direct connection between LLMs and turbulence, but just to emphasize that many/most complex phenomena do not give rise to smooth signals\[[19](https://arxiv.org/html/2606.24504#bib.bib19)\]and, accordingly to \([13](https://arxiv.org/html/2606.24504#S5.E13)\), this impacts directly into the value of the corresponding scaling exponents\. More precisely, rough signals lead to smaller scaling exponents than smooth ones\.
### 5\.1Scaling exponents of three\-dimensional homogeneous incompressible turbulence
Let us consider a turbulent eddy \(active degree of freedom turbulent flow\) of sizell\. More precisely, letv\(l\)v\(l\)be the increment of the velocity across a turbulent eddy of sizell, i\.e\.,v\(l\)=\|v\(x\+l\)−v\(x\)\|v\(l\)=\|v\(x\+l\)\-v\(x\)\|wherexxruns over the geometrical region occupied by the fluid \(for a homogeneous fluid the dependence onxxdrops out\)\. Based on Kolmogorov’s 1941 theory\[[20](https://arxiv.org/html/2606.24504#bib.bib20)\], the energy supplied to the flow at large scales is transferred to small scales, virtually with zero dissipation, through the process of energy cascade\. Large eddies break\-up into smaller daughter eddies, which further break\-up into even smaller grand\-daughter eddies and so on down the line until eddies are sufficiently small for dissipation to take over\.
According to K41, the energy flux is the same across scales, hence the energy dissipation rate writes as
ϵ\(l\)=v3\(l\)/l=const\.,\\epsilon\(l\)=v^\{3\}\(l\)/l=const\.\\;,\(14\)where we have taken the lifetime of the above eddy asτ\(l\)∼l/v\(l\)\\tau\(l\)\\sim l/v\(l\)\.
This impliesv\(l\)∼l1/3v\(l\)\\sim l^\{1/3\}, i\.e\., the turbulent flow isnotsmooth, but exhibits a scaling \(Hurst\) exponenth=1/3h=1/3, as opposed to a smooth \(differentiable\) signal featuringh=1h=1\. This alone brings the scaling exponent down by a factorhh, i\.e\.,α∼−4h/d\\alpha\\sim\-4h/d, with no need of invoking fractal structures, just the fact that the signal is not differentiable\.
The connection to fractals arises by the observation that K41 theory captures many features of turbulence, but fails to describe intermittency, namely the occurrence, here and there, of abrupt bursts of activity\. To account for this feature, one postulates that energy dissipation does not act as a space\-filling process, but rather occurs on a set of fractal dimensiond<Dd<DwhereDDis the dimension of the embedding space, for fluids typicallyD=2D=2orD=3D=3\.
To compute the fractal dimensiondd, an argument pretty \]similar to KS is usually adopted\[[21](https://arxiv.org/html/2606.24504#bib.bib21)\]\. Namely, one introduces the probabilityβ\(l\)\\beta\(l\)that a small sphere \(or cublet\) of radiusllintercepts ad−dimensionald\-dimensionalobject within a three\-dimensional \(D=3D=3\) cube of side11\. Such probability is readily shown to scale as
β\(l\)∼lD−d,\\beta\(l\)\\sim l^\{D\-d\}\\;,whereD−dD\-dis the so called co\-dimension\. Next, one assumes that each time a turbulent eddy of sizellbreaks down in a series of smaller eddies, only a fractionβ\(l\)\\beta\(l\)remains active, namely amenable to further breakup\. Hence, the kinetic energy available to sustain the energy cascade at scalellisE\(l\)=β\(l\)v2\(l\)E\(l\)=\\beta\(l\)v^\{2\}\(l\)\.
Combining this with \([14](https://arxiv.org/html/2606.24504#S5.E14)\), one obtains
v\(l\)∼ϵ0l1/3β\(l\)−1/3,v\(l\)\\sim\\epsilon\_\{0\}l^\{1/3\}\\beta\(l\)^\{\-1/3\}\\;,\(15\)yielding an effective exponent
h=1/3−\(D−d\)/3=d−23,h=1/3\-\(D\-d\)/3=\\frac\{d\-2\}\{3\}\\;,\(16\)where we have usedD=3D=3\. This shows that the K41 exponenth=1/3h=1/3corresponds to a space\-filling dissipation process withd=D=3d=D=3, whereas intermittency occurs on a fractal set of dimensiond<D=3d<D=3so thath<1/3h<1/3\.
Differently restated, the turbulent energy is a space\-filling process and in the K41 picture so is energy dissipation: dissipation “does not hide,” the intrinsic manifold fills up the entire embedding space\. Even so, the velocity signal is pretty rough, with a scaling exponenth=1/3h=1/3\.
In the case of intermittency, dissipation is no longer space filling, and the velocity signal gets further roughness in proportion to the co\-dimension of the fractal set where dissipation takes place, according to the expression \([16](https://arxiv.org/html/2606.24504#S5.E16)\)\.
In homogeneous incompressible fluid turbulence one findsd∼2\.8d\\sim 2\.8\(the so\-called beta model discussed above\) corresponding toh∼0\.27h\\sim 0\.27, but other models deliver different values, all above22, sinced<2d<2would implyh<0h<0, denoting a singularity in the velocity field, something that cannot happen in a homogeneous incompressible flows\.
In passing, it is worth mentioning that turbulence is in fact multifractal, meaning by this that dissipation inhabits a whole set of different manifolds, each with its own scaling exponent\[[17](https://arxiv.org/html/2606.24504#bib.bib17),[18](https://arxiv.org/html/2606.24504#bib.bib18)\]\.
The implications for LLMs are as follows\.
For non\-smooth processes, such as turbulence and many other complex dynamical systems withh<1h<1, the scaling exponents become even smaller, namely
α∼4hd\.\\alpha\\sim\\frac\{4h\}\{d\}\\;\.\(17\)Differently restated, the intrinsic dimensions of non\-smooth processes are magnified by a factor1/h1/h, which spells even more trouble for LLM performance for non\-smooth data\.
On the other hand, LLM appear to be amazingly efficient in compressing the relevant information to low\-dimensional manifolds\. While turbulence only “needs” to go fromD=3D=3tod\>2d\>2in order to pin\-down the dissipative manifold, LLMs accomplish far larger dimensional compression tasks: the embedding dimension of feature space in modern chatGPT’s scores in the several thousands, as opposed to an intrinsic dimensionsd∼100d\\sim 100, corresponding to a quite impressive dimensional compressiond/D∼O\(10−2\)d/D\\sim O\(10^\{\-2\}\)\.
The ability of LLMs to pin down and represent information under such extreme dimensional compression regimes is quite remarkable\.
## 6Conclusion
If the SK analysis is anything to go by, and we see no reason to doubt it, we must conclude that any dataset withd\>40d\>40is bound to featureα<0\.1\\alpha<0\.1, hence doomed to unsustainability, regardless of whether or not the loss function should be paralleled to a numerical discretization error\. As shown in this note, the situation is only going to get worse in the case of non\-smooth data, as it is often the case for complex systems\. This is the actual state of affairs: notwithstanding the absolutely amazing ability of LLMs to chase the information hidden in extremely tiny corners of ultra\-dimensional feature spaces, the dimension of the intrinsic manifolds is still too high to be sustainable\. This is why going down the “bigger is better” avenue under the drive of the “no\-wall” excitement is a sure recipe for energy burnout\. Quite possibly, a return to foundational physics\-aware AI models, the so called ”world models”\[[11](https://arxiv.org/html/2606.24504#bib.bib11)\]appears to offer a way more promising strategy\.
This work was partly supported by the Research Council of Norway through its Centers of Excellence funding scheme, project number 262644\. AH furthermore acknowledges funding from the European Research Council \(Grant Agreement 101141323 AGIPORE\)\.
## References
- \[1\]Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia \(December 2017\)\. ”Attention is All you Need”\. In I\. Guyon and U\. Von Luxburg and S\. Bengio and H\. Wallach and R\. Fergus and S\. Vishwanathan and R\. Garnett \(ed\.\), 31st Conference on Neural Information Processing Systems \(NIPS\)\. Advances in Neural Information Processing Systems\. Vol\. 30\. Curran Associates, Inc\. arXiv:1706\.03762 \(2023\)\.
- \[2\]J Jumper, R Evans, A Pritzel, T Green, M Figurnov, O Ronneberger, K Tunyasuvunakool, R Bates, A Žídek, A Potapenko, A Bridgland, C Meyer, S A\. A\. Kohl, AJ\. Ballard, A Cowie, B Romera\-Paredes, S Nikolov, R Jain, J Adler, T Back, S Petersen, D Reiman, E Clancy, M Zielinski, …Demis Hassabis, Highly accurate protein structure prediction with AlphaFold, Nature, 596,583–589 \(2021\)
- \[3\]Y\. Lecun, J\. Bengio and G\. Hilton, Deep Learning, Nature 521 \(7553\), 436\-444 \(2015\)
- \[4\]J\. Kaplan et al, Scaling laws for neural language models, arXiv:2001\.08361v1 \[cs\.LG\] 23 Jan 2020
- \[5\]Y Bahri, E Dyer, J Kaplan, J Lee, U Sharma, Explaining neural scaling laws, Proceedings of the National Academy of Sciences 121 \(27\), e2311878121, \(2024\)
- \[6\]Z\. Ji, M\. Jiang, A systematic review of electricity demand for large language models: evaluations, challenges, and solutions, Renewable and Sustainable Energy Reviews, Volume 225, 116159 \(2026\)
- \[7\]P\.V\. Coveney and S\. Succi, The wall confronting large language models, arXiv preprint arXiv:2507\.19703 \(2025\)
- \[8\]PV Coveney, ER Dougherty, RR Highfield, Big data need big theory too, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences; 374:20160153 \(2016\)
- \[9\]S\.Succi, PV Coveney, Big data: the end of the scientific method? Philosophical Transactions of the Royal Society A; 377:201801 \(2019\)
- \[10\]A\. Hansen, S\. Succi, A new kind of science, Frontiers in physics, 13, 1760758 \(2025\)
- \[11\]K Vafa, PG Chang, A Rambachan, S Mullainathan, What has a foundation model found? using inductive bias to probe for world models, arXiv preprint arXiv:2507\.06952 \(2025\)
- \[12\]Y Liu, Z Liu, J Gore, Superposition yields robust neural scaling, Advances in Neural Information Processing Systems, 38, 159269\-159305, 20 \(2026\)
- \[13\]J\. Hoffmann et al, Training Compute\-Optimal Large language Models, arXiv:2203\.15556v1 \[cs\.CL\] 29 Mar 2022
- \[14\]U\. Sharma and J\. Kaplan, Scaling Laws from the Data Manifold Dimension, Journal of Machine Learning Research 23, 1\-34 \(2024\)
- \[15\]A Ansuini, A Laio, JH Macke, D Zoccolan, Intrinsic dimension of data representations in deep neural networks, Advances in Neural Information Processing Systems 32 \(2019\)
- \[16\]R\. Bellman, Dynamic Programming, Princeton University Press, Princeton, NJ, \(1957\)
- \[17\]U\. Frisch, Turbulence, the legacy of A\.N\. Kolmogorov, Cambridge U\.P\., 1995
- \[18\]M\. Briscolini, P\. Santangelo, S\. Succi and R\. Benzi, Extended self\-similarity in the numerical simulation of three\-dimensional homogeneous flows, Phys\. Rev\. E, 50, 3, R1745 \(1994\)
- \[19\]S\. Succi, Sailing the ocean of complexity: lessons from the physics\-biology interface, Oxford: Oxford University Press \(2022\)
- \[20\]A\. N\. Kolmogorov, The local structure of turbulence in incompressible viscous fluid for very large Reynolds numbers, Dokl\. Akad\. Nauk SSSR, 30, 9\-13, reprinted in Proc\. Roy/ Soc\. London 434, 9\-13 \(1991\)
- \[21\]U\. Frisch, P\. Sulem, M\. Nelkin, J\. Fluid Mech\., vol\. 87, Aug\. 29, 719\-736 \(1978\)\.Similar Articles
Scaling laws for neural language models
Foundational empirical study demonstrating power-law scaling relationships between language model performance and model size, dataset size, and compute budget, with implications for optimal training allocation and sample efficiency.
Model Merging Scaling Laws in Large Language Models
This paper establishes empirical scaling laws for language model merging, identifying power-law relationships between model size, expert count, and performance to enable predictive planning for optimal model composition.
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
This paper systematically studies scale vectors in LLM normalization layers, showing they optimize training through a self-amplifying preconditioning effect, and proposes three lightweight improvements that enhance performance and scaling behavior with negligible overhead.
@ChrisGPotts: We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Hua…
This paper investigates why larger models outperform smaller ones, attributing it to data-induced competition for neural resources through formal analysis and experiments.
Model Size Scaling in 2023-2031 (21 minute read)
An analysis of AI model size scaling trends from 2023 to 2031, published on LessWrong.