A Theory of Training Profit-Optimal LLMs

arXiv cs.LG 05/19/26, 04:00 AM Papers
llm-training scaling-laws economic-model profit-optimization machine-learning ai-economics
Summary
This paper develops an economic model combining scaling laws with microeconomic theory to analyze profit-optimal training of large language models, considering trade-offs between model quality, training costs, and hardware efficiency.
arXiv:2605.16430v1 Announce Type: new Abstract: Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.
Original Article
View Cached Full Text
Cached at: 05/19/26, 06:42 AM
# A Theory of Training Profit-Optimal LLMs
Source: [https://arxiv.org/html/2605.16430](https://arxiv.org/html/2605.16430)
Sophie Hao∗ Boston University Boston, MA, USA uu@bu\.edu &William Merrill Allen Institute for AI Seattle, WA, USA willm@allenai\.org

###### Abstract

Scaling large language models \(LLMs\) requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure\. While it is established that scaling up LLMs reliably increases model quality \(quantified in terms of loss or downstream evaluations\), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger\-scale training and inference\. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory\. Under our model, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM\. On the other hand, additional parameters and training tokens both incur additional costs\. We analyze the profit maximization problem for this model under compute\-bound and data\-bound regimes\. In the compute\-bound regime, optimal model size and token budget track hardware efficiencyEE\(FLOPs/$\) at a near\-linear rate; total training cost then scales sub\-quadratically inEE\. Data efficiency improvements incentivize larger models and training expenditure\. When we are limited toDDdata, profit\-optimal training expenditure scales asD2/ED^\{2\}/E, i\.e, increase with data and*decreases*with hardware efficiency \(as well as data efficiency\)\. Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute\-bound regime, but are not profit\-optimal in the data\-bound regime or assuming hardware advances will stall\. Overall, our results provide a theory of profit\-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long\-term economic decision making\.

## 1Introduction

At the time of writing, tremendous capital expenditure has gone towards training LLMs\. At a high level, this bet is motivated by the empirical phenomenon of*scaling laws*: making LLMs larger and training them on more data monotonically improves their quality, measured in terms of training loss\(Kaplan et al\.,[2020](https://arxiv.org/html/2605.16430#bib.bib9); Hoffmann et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib8)\)\. Higher LLM quality is associated with better performance on downstream tasks\(Wei et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib16)\)\. This suggests scaling up LLM training could lead to models that useful for many potential users, and thus profitable for their trainers\.

On the other hand, improving LLM quality by scaling them up makes them more expensive, both in terms of raw computation \(“compute”, measured in FLOPs\) and dollars\. The compute required to train an LLM is proportional to both the parameter countnnand training data budgetdd, so scaling up both of these increases compute quadratically\. Inference compute scales withnnbut notdd\. Asnnandddhave jumped orders of magnitude, the compute required for training and inference have also increased exponentially\. While quality improvements are monotonic in these variables, they are also*diminishing*, leading to the question of whether doubling down on making training more expensive to chase vanishing quality improvements will make LLM training profitable in the long run\.

Hardware Efficiency\(FLOPs/$\)LLM Size\(params\)

Hardware Efficiency\(FLOPs/$\)Train Expenditure\($\)

Parameter Efficiency\(LLM quality/param\)LLM Size\(params\)

Parameter Efficiency\(LLM quality/param\)Train Expenditure\($\)

Figure 1:Our model predicts that a profit\-maximizing LLM firm scales its LLM training expenditures subquadratically with hardware efficiency and inversely with parameter efficiency, assuming the Chinchilla parameter exponents areα≈β≈\.3\\alpha\\approx\\beta\\approx\.3\. Plots are shown forγ=−1\\gamma=\-1\(to be defined in[Subsection 2\.1](https://arxiv.org/html/2605.16430#S2.SS1)\)\.In this paper, we address this open question by developing a theory of*profit\-optimal*training behavior for LLM firms\. Under our model, the firm chooses their model sizennand training data budgetdd\. Larger choices ofnnandddproduce higher\-quality LLMs, which increases demand for the LLM, allowing the firm to charge more per inference token\. On the other hand, larger choices ofnnandddincur additional training and inference costs, suggesting that there is some*profit\-optimal*n∗,d∗n^\{\*\},d^\{\*\}, i\.e\., a choice of these variables that maximizes profit\. We characterizen∗,d∗n^\{\*\},d^\{\*\}as a function of exogenous variables like hardware efficiency \(FLOPs/$\), the parameter and data efficiency of LLM training methods, and various natural constants\.

Our first contribution is to formalize the profit maximization problem for an LLM firm with a monopoly on the market \([Section 2](https://arxiv.org/html/2605.16430#S2)\)\. LLM quality as well as training and inference costs can be measured relatively uncontroversially; the major challenge here is defining precisely how LLM quality improvements affect demand and token price\. We formalize this in a general way where each consumer has a minimum quality threshold that makes the LLM useful to them \(e\.g\., it sufficiently solves all tasks relevant to their domain\)\. Then, the relationship between quality and demand reduces to a question about the distribution of this quality threshold across consumers\. We make the general assumption that it follows a power law∝1−q−γ\\propto 1\-q^\{\-\\gamma\}, where the exponentγ\\gammacontrols the degree to which inverse demand is diminishing in quality\. Following standard ideas in economics\(Acemoglu,[2025](https://arxiv.org/html/2605.16430#bib.bib1)\), we assume inverse demand is diminishing or at most linear in quality, i\.e\.,γ<−1\\gamma<\-1\. This weak assumption suffices to prove our main results\.

Having formalized the profit maximization problem,[Section 4](https://arxiv.org/html/2605.16430#S4)characterizes profit\-optimal LLM training in the*compute\-bound*regime, i\.e\., the setting where the LLM firm is limited by training and inference costs but not by the amount of data available\. We find that optimal model sizen∗n^\{\*\}and data budgetd∗d^\{\*\}increase with hardware efficiencyEEat a near\-linear rate ofE1/\(1\+αγ\)E^\{1/\(1\+\\alpha\\gamma\)\}, whereα\\alphais the Chinchilla parameter exponent\(Hoffmann et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib8)\)\. thus, overall training computeCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}scales withEEat a rate ofE1−αγ1\+αγE^\{\\frac\{1\-\\alpha\\gamma\}\{1\+\\alpha\\gamma\}\}\. We also consider the role of LLM training methodology: depending on the sign ofγ\\gamma, improvements in parameter efficiency can either increase or decreaseCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}\. In contrast, data efficiency improvements reducen∗n^\{\*\}but keepCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}fixed\.

Additionally, in[Section 5](https://arxiv.org/html/2605.16430#S5), we characterize profit\-optimal scaling in the data\-bound regime where a fixed maximum data budgetDDis prescribed\. Here,n∗n^\{\*\}scales withDDat a near\-linear rate, independent ofEE\. Notably, training expenditureCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}increases roughly quadratically withDD, but*decreases*withEE\. Bothn∗n^\{\*\}andCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}improve with data efficiency advances but decrease with parameter efficiency advances in the data\-bound regime\.

Finally, we consider in[Section 6](https://arxiv.org/html/2605.16430#S6)how the characterization of profit\-optimal LLM training under our model matches the empirical growth trends in training expenditure and related variables\. Withγ=0\\gamma=0, which we take as a weak prior, we find that current training expenditure exceeds what is profit\-optimal in the compute\-bounded regime\. Solving for the value ofγ\\gammathat would make current trends profit\-optimal, we findγ^≈−0\.77\\hat\{\\gamma\}\\approx\-0\.77, meaning that inverse demand is barely diminishing in LLM quality\. Thus, there is some version of our model where current trends are consistent with profit\-optimal training behavior in the compute\-bounded regime\.

Overall, we extend the compute\-optimal training framework\(Hoffmann et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib8)\)to model profit\-optimal LLM training; we also comprehensively characterize profit\-optimal training in the compute\-bound and data\-bound regime\. We hope our results can provide a rigorous framework for forecasting future developments in LLM training and critically engaging with industry trends; to this end, we include a comprehensive discussion of our results’ implications, underlying assumptions, and rectification with other narratives on the profitability of LLM scaling \([Section 7](https://arxiv.org/html/2605.16430#S7)\)\.

## 2Setup: LLM Firms and Profit Maximization

In microeconomics,firmsare entities that takeinput factorsand produce anoutput goodthat is sold to consumers\. A pizzeria, for example, is a firm whose input factors are pizza ingredients, rent, and labor, and whose output good is pizza\. Thetheory of the firmaims to describe the behavior of firms in terms of the quantity of inputs they consume and outputs they produce, assuming that each firm maximizes profit within a market that may or may not be competitive\.111SeeVarian and Melitz \([2024](https://arxiv.org/html/2605.16430#bib.bib15)\)for an overview of the relevant microeconomic theory for this paper\.

In this section, we develop a microeconomic model that describes the behavior of a firm that trains an LLM and uses it to run an AI chatbot service\. The firm’s inputs consist oftraining dataandcompute, and its output consists oftokensthat are sold to consumers\. Our goal is to make predictions about the relation between expectations of increased compute efficiency and the firm’s investment into scaling\.

Our model has the following characteristics\. Because the focus of this paper is on LLM scaling and not on the effects of competition, we assume that the LLM firm operates with monopoly power\. The LLM firm maximizes profit by deciding how many tokens to produce and sell, subject to technological constraints and consumer demand\. Additionally, the LLM firm decides how much data and compute it will invest into training the LLM\. A greater investment of input factors endows the LLM firm with a higherqualityLLM, which in turn increases demand for the LLM’s tokens\.

### 2\.1Consumer Behavior

ωf\(q\)\\omega f\(q\)ωf\(q\+Δq\)\\omega f\(q\+\\Delta q\)TokensPrice \($/token\)Figure 2:Inverse demand functions for tokens generated by an LLM of qualityqq\(black\) andq\+Δqq\+\\Delta q\(red\), whereΔq\>0\\Delta q\>0\. For any particular level of LLM qualityqq, demand for tokens is linear, withωln⁡\(q\)\\omega\\ln\(q\)being the highest possible price that a token could be sold for\. Training a better model increases the demand for tokens from that model\.LLMs are general\-purpose, open\-ended AI models, which can be applied to a potentially unlimited range oftasks\. LLMs of higherquality, as measured by inverse next\-token prediction loss, have been shown to achieve better performance\(Kaplan et al\.,[2020](https://arxiv.org/html/2605.16430#bib.bib9); Srivastava et al\.,[2023](https://arxiv.org/html/2605.16430#bib.bib14)\)on a wider range of tasks\(Brown et al\.,[2020](https://arxiv.org/html/2605.16430#bib.bib6); Wei et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib16)\)\. Accordingly, AI chatbot services are typically priced by the inference token, with higher prices charged for tokens generated by a higher\-quality LLM\.

In the theory of the firm, consumer behavior is described by aninverse demand functionthat predicts the unit priceppat which a good is sold from the quantityttof the good that is sold\. Following thelaw of demand, we assume thatppdecreases with respect tott\. Additionally, since higher\-quality LLMs can be applied to a wider range of tasks, we assume thatppincreases with respect to some measure of LLM qualityqq, which currently is left generic\. We capture both of these dependencies by proposing the followingquasilinearinverse demand function:

p\(t,q\)=ωf\(q\)−δt,p\(t,q\)=\\omega f\(q\)\-\\delta t,wheref\(q\)f\(q\)is some*linking function*from quality to inverse demand that must be strictly monotonic and differentiable with respect toqq\. Much of our analysis will apply to any choice offfsatisfying these properties, but three natural options aref\(q\)=ln⁡\(q\)f\(q\)=\\ln\(q\),f\(q\)=fγ\(q\)f\(q\)=f\_\{\\gamma\}\(q\)withγ\>0\\gamma\>0, andf\(q\)=fγ\(q\)f\(q\)=f\_\{\\gamma\}\(q\)withγ<0\\gamma<0, where

fγ\(q\)=1γ\(1−q−γ\)\.f\_\{\\gamma\}\(q\)=\\frac\{1\}\{\\gamma\}\\left\(1\-q^\{\-\\gamma\}\\right\)\.All of these satisfy the axioms above and also yield diminishing returns to increasing model quality as long asγ\>−1\\gamma\>\-1\. Settingγ=−1\\gamma=\-1makes inverse demand linear in quality\. Furthermore, we will see that they all can be motivated as special cases of the same general framework\.

qqf−1\(q\)f\_\{\-1\}\(q\)qqf0\(q\)f\_\{0\}\(q\)qqf1\(q\)f\_\{1\}\(q\)Figure 3:Inverse demand linking functions are parameterized byγ\\gamma, which controls the degree to which the demand for tokens generated by LLM quality experiences diminishing returns to scale\.#### Derivation of Linking Functions\.

We derive the general form offγ\(q\)f\_\{\\gamma\}\(q\)from the following assumptions, somewhat similar to thequantization modelframework ofMichaud et al\. \([2023](https://arxiv.org/html/2605.16430#bib.bib12)\):

1. 1\.Each potential consumer of the LLM has a*reservation quality*q∗\>0q\_\{\*\}\>0such that they will pay for the LLM if and only if its qualityq≥q∗q\\geq q\_\{\*\}\.
2. 2\.This implies a*density*over reservation qualities giving the rate at which tasks are unlocked as quality increases\. This density follows a power law1/q∗1\+γ1/q\_\{\*\}^\{1\+\\gamma\}, for some−1<γ\-1<\\gamma\.
3. 3\.The maximum price that can be charged for an LLM token is proportional to number of consumers willing to buy that token; i\.e\., the number of consumers for whomq≥q∗q\\geq q\_\{\*\}\.

Writingp\(t,q\)=ωfγ\(q\)−δtp\(t,q\)=\\omega f\_\{\\gamma\}\(q\)\-\\delta t, we have

fγ\(q\)=maxt≥0⁡p\(t,q\)ω=∫0q1q∗1\+γdq∗\.f\_\{\\gamma\}\(q\)=\\max\_\{t\\geq 0\}\\frac\{p\(t,q\)\}\{\\omega\}=\\int\_\{0\}^\{q\}\\frac\{1\}\{q\_\{\*\}^\{1\+\\gamma\}\}\\,\\mathrm\{d\}q\_\{\*\}\.We show in[Appendix A](https://arxiv.org/html/2605.16430#A1)that this formula recoversfγ\(q\)f\_\{\\gamma\}\(q\)in the way it was defined above\.

### 2\.2LLM Scaling

LeontiefChinchillaPerfect Substitutesσ=0\\sigma=0σ=\.76\\sigma=\.76σ=∞\\sigma=\\inftynnddnnddnnddFigure 4:A scaling law’s elasticity of substitutionσ\\sigmameasures the curvature of itsiso\-quality curves, whereq\(n,d\)q\(n,d\)is constant\. Whenσ<1\\sigma<1, we say thatnnandddarecomplements, and quality is optimized whennnandddare scaled together\. Whenσ≥1\\sigma\\geq 1,nnandddaresubstitutes, meaning they can be exchanged for one another without sacrificing quality\.Research on LLM scaling has shown that an LLM’s lossℓ\\ellon next\-token prediction is determined by the LLM’smodel sizenn, in number of trainable parameters, andtraining data sizedd, in tokens\(Kaplan et al\.,[2020](https://arxiv.org/html/2605.16430#bib.bib9); Hoffmann et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib8)\), where higher\-quality models generally have a lower loss\.Hoffmann et al\. \([2022](https://arxiv.org/html/2605.16430#bib.bib8)\)in particular show that model size and training data size arecomplements: scaling bothnnandddresults in the greatest possible increase in LLM quality\. To capture this relation, we propose theLeontief scaling lawfor model quality as a function ofnnanddd:

q\(n,d\)=min⁡\{anα,bdβ\}q\(n,d\)=\\min\\\{an^\{\\alpha\},bd^\{\\beta\}\\\}wherea\>0a\>0,b\>0b\>0,0<α≤10<\\alpha\\leq 1, and0<β≤10<\\beta\\leq 1\.

The Leontief scaling law is based on descriptions of production technologies where input factors areperfect complements; i\.e\., no increase in production is obtained unless all input factors are scaled simultaneously\(Leontief,[1941](https://arxiv.org/html/2605.16430#bib.bib10)\)\. The degree to which input factors must be scaled together is measured by a scaling law’selasticity of substitution,σ\\sigma\. As shown in[Figure 4](https://arxiv.org/html/2605.16430#S2.F4),σ\\sigmameasures the the degree to whichnnandddcan be substituted with one another without sacrificing quality\.nnandddaresubstituteswhenσ≥1\\sigma\\geq 1and complements whenσ<1\\sigma<1\. In[Appendix B](https://arxiv.org/html/2605.16430#A2)we show that[Hoffmann et al\.](https://arxiv.org/html/2605.16430#bib.bib8)’s \([2022](https://arxiv.org/html/2605.16430#bib.bib8)\)Chinchilla scaling lawhas an elasticity of substitution ofσ≈\.76\\sigma\\approx\.76, which makesnnandddcomplements\. The Leontief scaling law simplifies our analysis by idealizing this relationship\.

### 2\.3Profit Maximization Problem

The LLM firm faces aprofit maximization problemgiven by:

n∗,d∗,t∗=argmaxn,d,tπ\(n,d,t\)subject ton≥0,d≥0,t≥0n^\{\*\},d^\{\*\},t^\{\*\}=\\operatorname\*\{argmax\}\_\{n,d,t\}\\pi\(n,d,t\)\\mathrel\{\\text\{subject to\}\}n\\geq 0,d\\geq 0,t\\geq 0whereπ\(n,d,t\)\\pi\(n,d,t\)is the profit earned by sellingtttokens generated by an LLM of sizenntrained onddtokens of data\. Following microeconomic theory, we assume that the firm chooses values ofnn,dd, andttthat maximize profit\. The solution to the profit maximization problem therefore gives a complete description of the LLM firm’s behavior\.

The LLM firm’s profitπ\\piis given by itsrevenueRRminus itscostCC\. The LLM firm’s revenue is the amount of money it earns by sellingtttokens generated by an LLM of qualityqqgiven by the Leontief scaling law, at the price of $pp/token given by the inverse demand function\.

R\(n,d,t\)=p\(t,q\(n,d\)\)t=ωt⋅f\(q\(n,d\)\)−δt2R\(n,d,t\)=p\(t,q\(n,d\)\)t=\\omega t\\cdot f\(q\(n,d\)\)\-\\delta t^\{2\}
The LLM firm’s cost is that of purchasingctrain\+cinfc\_\{\\operatorname\{train\}\}\+c\_\{\\inf\}FLOPs of training and inference compute, respectively, at a price of $pcp\_\{c\}/FLOP\. We assume without loss of generality that training data are free\. The price of compute is determined bypc=1/Ep\_\{c\}=1/E, whereEEis the equilibriumhardware efficiencysupplied by the market\. We followKaplan et al\. \([2020](https://arxiv.org/html/2605.16430#bib.bib9)\)in assuming thatctrain=6ndc\_\{\\operatorname\{train\}\}=6ndandcinf=2ntc\_\{\\inf\}=2nt, which implies that the costs of training and inference in dollars areCtrain=6nd/EC\_\{\\operatorname\{train\}\}=6nd/EandCinf=2nt/EC\_\{\\inf\}=2nt/E, respectively\. This means the overall cost incurred by the LLM firm is

C\(n,d,t\)=Ctrain\+Cinf=6nd\+2ntE\.C\(n,d,t\)=C\_\{\\operatorname\{train\}\}\+C\_\{\\inf\}=\\frac\{6nd\+2nt\}\{E\}\.Finally, defining profit in terms of revenue and cost, we have

π\(n,d,t\)=R\(n,d,t\)−C\(n,d,t\)=ωt⋅f\(q\(n,d\)\)−δt2−6nd\+2ntE\.\\pi\(n,d,t\)=R\(n,d,t\)\-C\(n,d,t\)=\\omega t\\cdot f\(q\(n,d\)\)\-\\delta t^\{2\}\-\\frac\{6nd\+2nt\}\{E\}\.

## 3Behavior of Profit\-Maximizing Firms

We analyze the LLM firm’s behavior by studying the local maxima of the profit functionπ\\pi\. In this section, we solve fort∗t^\{\*\}andd∗d^\{\*\}in terms ofnn, reducingπ\\pito one variable\. The analysis of this section is agnostic to our choice of inverse demand linking functionff\.

To solve fortt, we observe thatπ\\piis quadratic and concave\-down intt\. Thus, for any given value ofnnanddd,t∗t^\{\*\}is given by the followingfirst\-order condition:

0=∂π\(n,d,t∗\)∂t=ω⋅f\(q\(n,d\)\)−2δt−2nE\.0=\\frac\{\\partial\\pi\(n,d,t^\{\*\}\)\}\{\\partial t\}=\\omega\\cdot f\(q\(n,d\)\)\-2\\delta t\-\\frac\{2n\}\{E\}\.\(tt\)Solving \([tt](https://arxiv.org/html/2605.16430#S3.Ex9)\) forttgives us the following\.

###### Lemma 1\.

Consider anyffandqqand fixnnanddd\. Then,π\(n,d,t\)\\pi\(n,d,t\)has a local maximum att=t∗\(n,d\)t=t^\{\*\}\(n,d\), where

t∗\(n,d\)=12δ\(ω⋅f\(q\(n,d\)\)−2nE\)\.t^\{\*\}\(n,d\)=\\frac\{1\}\{2\\delta\}\\left\(\\omega\\cdot f\(q\(n,d\)\)\-\\frac\{2n\}\{E\}\\right\)\.

By substitutingt=t∗\(n,d\)t=t^\{\*\}\(n,d\), we express the profit function in terms ofnnandddonly:

π\(n,d\)=π\(n,d,t∗\(n,d\)\)=14δ\(ω⋅f\(q\(n,d\)\)−2nE\)2−6ndE\.\\pi\(n,d\)=\\pi\(n,d,t^\{\*\}\(n,d\)\)=\\frac\{1\}\{4\\delta\}\\left\(\\omega\\cdot f\(q\(n,d\)\)\-\\frac\{2n\}\{E\}\\right\)^\{2\}\-\\frac\{6nd\}\{E\}\.
Next, we show that under the Leontief scaling law, we can eliminate the variableddinπ\\piwithout loss of generality by assuming thatanα=bdβan^\{\\alpha\}=bd^\{\\beta\}\. This is becausennandddare perfect complements under Leontief scaling: whenanα=bdβan^\{\\alpha\}=bd^\{\\beta\},q\(n,d\)q\(n,d\)does not improve with further training\. We show here that the same holds of the LLM firm’s profit, which means that any profit\-optimaln∗,d∗n^\{\*\},d^\{\*\}must also be*quality\-optimal*, in a sense analogous to Chinchilla compute optimality\(Hoffmann et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib8)\)\.

###### Lemma 2\.

Letq\(n,d\)=min⁡\{anα,bdβ\}q\(n,d\)=\\min\\\{an^\{\\alpha\},bd^\{\\beta\}\\\}and letn,d≥0n,d\\geq 0\. Then there existn′∈\[0,n\]n^\{\\prime\}\\in\[0,n\]andd′∈\[0,d\]d^\{\\prime\}\\in\[0,d\]such thatanα=bdβan^\{\\alpha\}=bd^\{\\beta\}and:

1. 1\.Quality is preserved, i\.e\.,q\(n′,d′\)=q\(n,d\)q\(n^\{\\prime\},d^\{\\prime\}\)=q\(n,d\)\.
2. 2\.For allt\>0t\>0we haveπ\(n′,d′,t\)≥π\(n,d,t\)\\pi\(n^\{\\prime\},d^\{\\prime\},t\)\\geq\\pi\(n,d,t\), with equality if and only if\(n′,d′\)=\(n,d\)\(n^\{\\prime\},d^\{\\prime\}\)=\(n,d\)\.

Thus, fort\>0t\>0, any choice ofn,dn,dthat maximizes profitπ\\pimust satisfyanα=bdβan^\{\\alpha\}=bd^\{\\beta\}\.

###### Proof\.

By contradiction, assume we have somen,dn,dthat maximizes profit such thatanα\>bdβan^\{\\alpha\}\>bd^\{\\beta\}\. \(Theanα<bdβan^\{\\alpha\}<bd^\{\\beta\}case is analogous\.\) Choosen′=\(b/a\)1/αdβ/αn^\{\\prime\}=\\left\(b/a\\right\)^\{1/\\alpha\}d^\{\\beta/\\alpha\}; observe thatn′,dn^\{\\prime\},dachieves the same quality asn,dn,dand thus the same revenue, due to the monotonicity offf\. At the same time, it attains strictly lower cost becausen′<nn^\{\\prime\}<nand cost is monotonic innn, assumingt\>0t\>0\. Thus,π\(n′,d,t\)\>π\(n,d,t\)\\pi\(n^\{\\prime\},d,t\)\>\\pi\(n,d,t\)for anytt\. ∎

Solvinganα=bdβan^\{\\alpha\}=bd^\{\\beta\}fordd, we obtain

d=\(ab\)1/βnα/β=ρnα/β,d=\\left\(\\frac\{a\}\{b\}\\right\)^\{1/\\beta\}n^\{\\alpha/\\beta\}=\\rho n^\{\\alpha/\\beta\},where we defineρ≜\(a/b\)1/α\\rho\\triangleq\(a/b\)^\{1/\\alpha\}to be the Leontief scaling law’s*parameter\-to\-token efficiency factor*\.ρ\\rhorepresents how much more efficient parameters are compared to additional training tokens for improving LLM quality\. Settingd=ρnα/βd=\\rho n^\{\\alpha/\\beta\}, the single\-variable version of the profit function is:

π\(n\)=π\(n,ρnα/β\)=14δ\(ωf\(anα\)−2nE\)2−6ρn1\+α/βE\.\\pi\(n\)=\\pi\(n,\\rho n^\{\\alpha/\\beta\}\)=\\frac\{1\}\{4\\delta\}\\left\(\\omega f\(an^\{\\alpha\}\)\-\\frac\{2n\}\{E\}\\right\)^\{2\}\-\\frac\{6\\rho n^\{1\+\\alpha/\\beta\}\}\{E\}\.
Finally, empirical research on LLM scaling has found thatα≈β\\alpha\\approx\\beta\(Hoffmann et al\.,[2022](https://arxiv.org/html/2605.16430#bib.bib8); Besiroglu et al\.,[2024](https://arxiv.org/html/2605.16430#bib.bib4); Merrill et al\.,[2026](https://arxiv.org/html/2605.16430#bib.bib11)\)\.222Hoffmann et al\. \([2022](https://arxiv.org/html/2605.16430#bib.bib8)\)estimateα=\.34\\alpha=\.34andβ=\.28\\beta=\.28;Besiroglu et al\. \([2024](https://arxiv.org/html/2605.16430#bib.bib4)\)estimateα=\.35\\alpha=\.35andβ=\.37\\beta=\.37;Merrill et al\. \([2026](https://arxiv.org/html/2605.16430#bib.bib11)\)estimateα=\.25\\alpha=\.25andβ=\.21\\beta=\.21,α=\.23\\alpha=\.23andβ=\.22\\beta=\.22, andα=\.18\\alpha=\.18andβ=\.23\\beta=\.23for three families of LLMs\.Settingα=β\\alpha=\\betasimplifiesπ\(n\)\\pi\(n\)to

π\(n\)=π\(n,ρn\)=14δ\(ωf\(anα\)−2nE\)2−6ρn2E\.\\pi\(n\)=\\pi\(n,\\rho n\)=\\frac\{1\}\{4\\delta\}\\left\(\\omega f\(an^\{\\alpha\}\)\-\\frac\{2n\}\{E\}\\right\)^\{2\}\-\\frac\{6\\rho n^\{2\}\}\{E\}\.

## 4Profit\-Optimal Scaling Based on Hardware and Algorithmic Efficiency

We now investigate thecomparative staticsof our model—how the LLM firm’s behavior responds to changes in parameters that are exogenous to the firm’s profit maximization problem\. In particular, we study how the size of the LLM trained by the firm,n∗n^\{\*\}, as well as the LLM firm’s total investment in training,Ctrain∗C^\{\*\}\_\{\\operatorname\{train\}\}, scale withEE,aa, andρ\\rho\.

In general, the LLM firm’s profit maximization problem does not admit a tractable closed\-form solution\. In order to analyze the LLM firm’s comparative statics, we derive asymptotic bounds onn∗n^\{\*\}andCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}in terms ofEE,aa, andρ\\rho, treating separately the cases where the inverse demand function is given byfγ\(q\)f\_\{\\gamma\}\(q\)whereγ≠0\\gamma\\neq 0\([Subsection 4\.1](https://arxiv.org/html/2605.16430#S4.SS1)\) andf0\(q\)=ln⁡\(q\)f\_\{0\}\(q\)=\\ln\(q\)\([Subsection 4\.2](https://arxiv.org/html/2605.16430#S4.SS2)\)\.

### 4\.1Polynomial\-Quasilinear Demand \(γ≠0\\gamma\\neq 0\)

We obtain the following characterization of the LLM firm’s behavior when the inverse demand linking functionfγ\(q\)=1γ\(1−q−γ\)f\_\{\\gamma\}\(q\)=\\frac\{1\}\{\\gamma\}\\left\(1\-q^\{\-\\gamma\}\\right\)forγ≠0\\gamma\\neq 0\. We defer a proof to[Appendix D](https://arxiv.org/html/2605.16430#A4)\.

###### Theorem 1\.

Letα=β\\alpha=\\beta\. Suppose the inverse demand function is given byfγf\_\{\\gamma\}forγ≠0\\gamma\\neq 0\. Then, assumingαγ≤1\\alpha\\gamma\\leq 1andE\>1/\(6δρ\)E\>1/\(6\\delta\\rho\), the solution to the profit maximization problem satisfies

n∗=O\(\(Eρaγ\)1/\(1\+αγ\)\)d∗=O\(\(ραγEaγ\)1/\(1\+αγ\)\)\.n^\{\*\}=O\\left\(\\left\(\\frac\{E\}\{\\rho a^\{\\gamma\}\}\\right\)^\{1/\(1\+\\alpha\\gamma\)\}\\right\)\\quad\\quad\\quad d^\{\*\}=O\\left\(\\left\(\\frac\{\\rho^\{\\alpha\\gamma\}E\}\{a^\{\\gamma\}\}\\right\)^\{1/\(1\+\\alpha\\gamma\)\}\\right\)\.As a result, the LLM firm’s optimal training expenditureCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}is bounded as

Ctrain∗=O\(\(Eaγ\)1/\(1\+αγ\)\)\.C^\{\*\}\_\{\\operatorname\{train\}\}=O\\left\(\\left\(\\frac\{E\}\{a^\{\\gamma\}\}\\right\)^\{1/\(1\+\\alpha\\gamma\)\}\\right\)\.

[Theorem 1](https://arxiv.org/html/2605.16430#Thmtheorem1)says that LLM sizen∗n^\{\*\}, training data budgetd∗d^\{\*\}, and training expenditureCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}all increase withEE: the rate is superlinear forγ<0\\gamma<0, linear atγ=0\\gamma=0, and sublinear forγ\>0\\gamma\>0\. Withα≈0\.3\\alpha\\approx 0\.3, we obtain an upper bound ofn∗,d∗≲E1\.43n^\{\*\},d^\{\*\}\\lesssim E^\{1\.43\}andCtrain∗≲E1\.86C^\{\*\}\_\{\\operatorname\{train\}\}\\lesssim E^\{1\.86\}\. Assumingγ\>−1/α\\gamma\>\-1/\\alpha, which follows from assuming diminishingfγf\_\{\\gamma\}since−1/α≈−3\.33\-1/\\alpha\\approx\-3\.33, increasing parameter efficiency*decreases*n∗n^\{\*\}\. The upper bound onCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}also decreases as parameter efficiency increases\. In contrast, increasing data efficiency always increasesn∗n^\{\*\}and also increasesd∗d^\{\*\}ifγ\>0\\gamma\>0\.

### 4\.2Log\-Quasilinear Demand \(γ=0\\gamma=0\)

We obtain the following characterization of the LLM firm’s behavior when the inverse demand linking functionf0\(q\)=ln⁡qf\_\{0\}\(q\)=\\ln q\. We defer a proof to[Appendix C](https://arxiv.org/html/2605.16430#A3)\.

###### Theorem 2\.

Letα=β\\alpha=\\beta\. Suppose the inverse demand function is given byf0f\_\{0\}\. Then, assumingE\>1/\(6δρ\)E\>1/\(6\\delta\\rho\), the solution to the profit maximization problem satisfies

n∗=O\(aEρ\)d∗=O\(aE\)\.n^\{\*\}=O\\left\(\\frac\{aE\}\{\\rho\}\\right\)\\quad\\quad d^\{\*\}=O\\left\(aE\\right\)\.As a result, the optimal training expenditureCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}is bounded as

Ctrain∗=O\(a2Eρ\)\.C^\{\*\}\_\{\\operatorname\{train\}\}=O\\left\(\\frac\{a^\{2\}E\}\{\\rho\}\\right\)\.

[Theorem 2](https://arxiv.org/html/2605.16430#Thmtheorem2)says that, withγ=0\\gamma=0,n∗n^\{\*\},d∗d^\{\*\}, andCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}all scale linearly with hardware efficiencyEE\. Increased parameter efficiencyaaincreases optimal data budgetd∗d^\{\*\}\. On the other hand, increasingaadecreasesn∗n^\{\*\}andCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}whenα<1/2\\alpha<1/2, which is satisfied in practice\. In contrast, data efficiency improvements incentivize larger models and overall training expenditure\.

## 5Profit\-Optimal Scaling in the Data\-Bound Regime

We now turn to data\-bound regime where the number of pretraining tokens is limited within0≤d≤D0\\leq d\\leq D\. In this regime, we can express our solution in terms of not justA,EA,E, but alsoDD\. Using the fact thata\(n∗\)α=b\(d∗\)βa\(n^\{\*\}\)^\{\\alpha\}=b\(d^\{\*\}\)^\{\\beta\}\([Lemma 2](https://arxiv.org/html/2605.16430#Thmlemma2)\), we immediately obtain the following result:

###### Theorem 3\.

Enforce that0≤d≤D0\\leq d\\leq D\. Ifd∗d^\{\*\}from the compute\-bound analysis \([Section 4](https://arxiv.org/html/2605.16430#S4)\) satisfiesd∗≤Dd^\{\*\}\\leq D, then that choice ofn∗,d∗n^\{\*\},d^\{\*\}remains the profit maximum\. If not, the profit maximum is given by

n∗=ρ−1Dβ/α,d∗=D\.n^\{\*\}=\\rho^\{\-1\}D^\{\\beta/\\alpha\},\\quad\\quad d^\{\*\}=D\.

In other words, optimal model size depends on both algorithmic efficiency and data, but not compute efficiency\. For training expenditure, we get

Ctrain∗=6n∗d∗E=Θ\(D1\+β/αρE\)\.C^\{\*\}\_\{\\operatorname\{train\}\}=\\frac\{6n^\{\*\}d^\{\*\}\}\{E\}=\\Theta\\left\(\\frac\{D^\{1\+\\beta/\\alpha\}\}\{\\rho E\}\\right\)\.Thus, optimal training expenditure grows roughly quadratically with the amount of data available\. It also grows with data efficiency but*shrink*with parameter efficiency and hardware efficiency\.

## 6Empirics: Real\-World Training Expenditure Trends

Our theoretical model makes for predictions for the way that advances in hardware and compute efficiency should should shift profit\-optimal expenditure on training compute\. We now compare these to trends in practice using estimates for the annualized growth rates of these variables from the Epoch\.ai dashboard333Taken from[https://epoch\.ai/trends](https://epoch.ai/trends)on May 3, 2026\.\. They report training computeC^train∝5t\\hat\{C\}\_\{\\operatorname\{train\}\}\\propto 5^\{t\}, hardware efficiencyE∝1\.37tE\\propto 1\.37^\{t\}, and compute \(“algorithmic”\) efficiencyab∝3tab\\propto 3^\{t\}, wherettrepresents time in years\. We assume that compute efficiency improvements affect parameter and data efficiency equally, i\.e\.,a∝b∝3ta\\propto b\\propto\\sqrt\{3^\{t\}\}\.

We can now compare the observed growth rate ofC^train\\hat\{C\}\_\{\\operatorname\{train\}\}to our upper bound on optimal training compute according to these efficiency measurements\. ParameterizingCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}in terms of these annualized growth rates, we have, forγ≠0\\gamma\\neq 0,

Ctrain∗\(γ\)\\displaystyle C^\{\*\}\_\{\\operatorname\{train\}\}\(\\gamma\)≲\(1\.37t\)1−αγ1\+αγ⋅\(3t\)−2γ1\+αγ\\displaystyle\\lesssim\\left\(1\.37^\{t\}\\right\)^\{\\frac\{1\-\\alpha\\gamma\}\{1\+\\alpha\\gamma\}\}\\cdot\\left\(\\sqrt\{3^\{t\}\}\\right\)^\{\-\\frac\{2\\gamma\}\{1\+\\alpha\\gamma\}\}=\(1\.371−αγ1\+αγ⋅3−γ1\+αγ\)t\.\\displaystyle=\\left\(1\.37^\{\\frac\{1\-\\alpha\\gamma\}\{1\+\\alpha\\gamma\}\}\\cdot 3^\{\-\\frac\{\\gamma\}\{1\+\\alpha\\gamma\}\}\\right\)^\{t\}\.We will assumeα≈β≈0\.3\\alpha\\approx\\beta\\approx 0\.3and consider the trends under different choices ofγ\\gamma\. Profit\-optimal training expenditure is largest under our model whenγ=−1\\gamma=\-1, where[Theorem 1](https://arxiv.org/html/2605.16430#Thmtheorem1)gives the bound

Ctrain∗\(−1\)≲\(1\.371\+α1−α⋅311−α\)t≈7\.59t\.C^\{\*\}\_\{\\operatorname\{train\}\}\(\-1\)\\lesssim\\left\(1\.37^\{\\frac\{1\+\\alpha\}\{1\-\\alpha\}\}\\cdot 3^\{\\frac\{1\}\{1\-\\alpha\}\}\\right\)^\{t\}\\approx 7\.59^\{t\}\.Thus, the observed empirical growth rate is within the most permissive bounds of our model\. However, asγ\\gammaincreases, the exponents forEEandaadecrease\. Withγ=0\\gamma=0,[Theorem 2](https://arxiv.org/html/2605.16430#Thmtheorem2)gives the bound

Ctrain∗\(0\)≲a2E∝\(3⋅1\.37\)t≈4\.11t\.C^\{\*\}\_\{\\operatorname\{train\}\}\(0\)\\lesssim a^\{2\}E\\propto\\left\(3\\cdot 1\.37\\right\)^\{t\}\\approx 4\.11^\{t\}\.Thus, assumingγ=0\\gamma=0,C^train\\hat\{C\}\_\{\\operatorname\{train\}\}is growing too fast in practice compared to what our model predicts is profit\-optimal\. Solving for the break\-even pointγ^\\hat\{\\gamma\}via

1\.371−αγ^1\+αγ^⋅3−γ^1\+αγ^=5,1\.37^\{\\frac\{1\-\\alpha\\hat\{\\gamma\}\}\{1\+\\alpha\\hat\{\\gamma\}\}\}\\cdot 3^\{\-\\frac\{\\hat\{\\gamma\}\}\{1\+\\alpha\\hat\{\\gamma\}\}\}=5,we see that it isγ^≈−0\.77\\hat\{\\gamma\}\\approx\-0\.77\. Thus, while the current rate of growth in training compute exceeds profit\-optimal scaling under most choices ofγ\\gammaunder our model, there are some choices ofγ≈−1\\gamma\\approx\-1where inverse demand is only slightly diminishing in quality that are consistent with current trends\.

## 7Discussion: Is Current Training Expenditure Profit\-Optimal?

Before considering this question, we first summarize the qualitative findings under our model\. In the compute\-bound regime, optimal model size, data budget, and training investment grow as hardware efficiency improves: the rates are at best subquadratic forγ=−1\\gamma=\-1but shrink if if inverse demand is more diminishing \(asγ\\gammaincreases\)\. While advances in data efficiency incentivize larger models and training expenditure, the role of parameter efficiency is less clear\.

In the data\-bound regime, more data being available incentivizes larger models at a near\-linear rate \(and thus compute increases near\-quadratically\)\. In this regime, advances in data efficiency incentivize larger models and more training compute, whereas advances in parameter efficiency incentivize smaller models and less training compute\.

We compare our model’s predictions in the compute\-bound case to the empirical trends in training compute vs\. hardware and compute efficiency\. Under the choice ofγ=0\\gamma=0\(which we take as a reasonable default, though with significant uncertainty\), we found that current growth rate in training compute*exceeds*what would be profit\-optimal\. Solving for the value ofγ\\gammathat would make the current rate of compute growth profit\-optimal, we foundγ^≈−0\.77\\hat\{\\gamma\}\\approx\-0\.77, which is within the range of what we considered possible in our model, though towards the lower end\. Further, in the data\-bound case, our model predicts hardware efficiency improvements actually*reduce*profit\-optimal model size and training expenditure, so, if modern training runs are data\-bound, the current growth rate in compute expenditure would unequivocally exceed the profit\-optimal upper bound under our model\.

### 7\.1Rectification with Existing Narratives

Significant ink has been spilled in popular discourse on the scaling of LLMs\(e\.g\.,*the Scaling Hypothesis*; Branwen,[2022](https://arxiv.org/html/2605.16430#bib.bib5)\)and their potential profitability\. We therefore compare the predictions that our model makes for profit\-optimal scaling against selected informal claims made on these themes\. First, we consider the following excerpt from[Altman](https://arxiv.org/html/2605.16430#bib.bib3)’s \([2025](https://arxiv.org/html/2605.16430#bib.bib3)\)*Three Observations*blog post:

> The cost to use a given level of AI falls about 10×\\timesevery 12 months, and lower prices lead to much more use … \[and\] the socioeconomic value of linearly increasing intelligence is super\-exponential in nature\. A consequence of this is that we see no reason for exponentially increasing investment to stop in the near future\.

As the CEO of OpenAI,[Altman](https://arxiv.org/html/2605.16430#bib.bib3)has a clear agenda here\. The claim that “socioeconomic value” is superexponential in “intelligence” does not seem well\-defined, though perhaps it could be mapping onto inverse demand and quality in our framework\. Nevertheless, his claim that it could be profitable for training expenditure to grow exponentially with time is consistent with our model assuming hardware efficiency, data efficiency, or parameter efficiency \(for certain values ofγ\\gamma\) also continue to grow exponentially, and the amount of training data available does not become a constraint\.

This begs the question: what happens in our model assuming hardware advances plateau? In a recent blog post,Dettmers \([2025](https://arxiv.org/html/2605.16430#bib.bib7)\)argues it is likely that the rate of growth inEEstalls due to physical limits, and this would limit the degree to which LLMs could be scaled up\. Putting aside the empirical question of whether this is true, we can analyze the implications of stalled hardware advances under our model by takingE=O\(1\)E=O\(1\)w\.r\.t timett\. Revisiting the analysis from[Section 6](https://arxiv.org/html/2605.16430#S6),

Ctrain∗\(−1\)\\displaystyle C^\{\*\}\_\{\\operatorname\{train\}\}\(\-1\)≲\(311−α\)t≈4\.66t\\displaystyle\\lesssim\\left\(3^\{\\frac\{1\}\{1\-\\alpha\}\}\\right\)^\{t\}\\approx 4\.66^\{t\}Ctrain∗\(0\)\\displaystyle C^\{\*\}\_\{\\operatorname\{train\}\}\(0\)≲3t\.\\displaystyle\\lesssim 3^\{t\}\.That is, an exponential growth rate inCtrain∗C^\{\*\}\_\{\\operatorname\{train\}\}is still possible in certain regimes ofγ\\gammaassuming modeling advances in parameter\- and data\-efficiency continue\. However, the current empirical growth rate in training expenditureC^train∝5t\\hat\{C\}\_\{\\operatorname\{train\}\}\\propto 5^\{t\}exceeds these upper bounds, even in the most permissive case ofγ=−1\\gamma=\-1\. Thus, assuming hardware advances will stagnate, our model predictions agree with[Dettmers](https://arxiv.org/html/2605.16430#bib.bib7)that current rates of training expenditure would exceed what is profit\-optimal\.

### 7\.2Key Assumptions of Our Model

A key assumption in our analysis is that the LLM firm has a monopoly on the market\. In the case where there are competing LLM firms, their profit\-optimal behavior could change\. Extending our analysis to account for competition would be an interesting direction for future work\.

Another simplification we have made is using Leontief scaling for quality rather than Chinchilla scaling, which is justified by the fact thatn,dn,dare approximately complements under Chinchilla, i\.e\., both must increase together to avoid diminishing returns, as visualized in[Figure 4](https://arxiv.org/html/2605.16430#S2.F4)\. It could still be interesting to extend the analysis to parameterize quality as the inverse of reducible loss under Chinchilla scaling, though overall we view Leontief as a quite reasonable approximation\.

Finally, as discussed in[Subsection 2\.1](https://arxiv.org/html/2605.16430#S2.SS1), we assume the inverse demand is diminishing in model quality, i\.e\.,γ\>−1\\gamma\>\-1\. This is consistent with the law of diminishing returns in economics and similar to assumptions made in other analyses of the economic impacts of AI\(Acemoglu,[2025](https://arxiv.org/html/2605.16430#bib.bib1)\); essentially, it follows if we view AI as “normal technology”\(Narayanan and Kapoor,[2025](https://arxiv.org/html/2605.16430#bib.bib13)\)\. However, if one is convinced that quality increases in LLMs could be exceptionally transformative relative to quality improvements in other technologies, one might instead make the unconventional choice to model inverse demand as*superlinear*in LLM quality, i\.e\., setγ<−1\\gamma<\-1\. Similarly, our model does not account for the possibility of recursive self\-improvement\(Altair and Sotala,[2025](https://arxiv.org/html/2605.16430#bib.bib2)\), the idea that higher\-quality models might accelerate the rate of improvement in hardware, parameter, and data efficiency\.

## References

- Acemoglu \(2025\)Daron Acemoglu\. 2025\.[The simple macroeconomics of AI](https://doi.org/10.1093/epolic/eiae042)\.*Economic Policy*, 40\(121\):13–58\.
- Altair and Sotala \(2025\)Alex Altair and Kaj Sotala\. 2025\.[Recursive Self\-Improvement](https://www.alignmentforum.org/w/recursive-self-improvement)\.Webpage,AI Alignment Forum\. Accessed: 2026\-05\-06\.
- Altman \(2025\)Sam Altman\. 2025\.[Three Observations](https://blog.samaltman.com/three-observations)\.Blog post,Sam Altman\. Accessed: 2026\-03\-26\.
- Besiroglu et al\. \(2024\)Tamay Besiroglu, Ege Erdil, Matthew Barnett, and Josh You\. 2024\.[Chinchilla Scaling: A replication attempt](https://doi.org/10.48550/arXiv.2404.10102)\.Computing Research Repository, arXiv:2404\.10102\.
- Branwen \(2022\)Gwern Branwen\. 2022\.[The Scaling Hypothesis](https://gwern.net/scaling-hypothesis)\.Blog post,Gwern\.net\. Accessed: 2026\-03\-20\.
- Brown et al\. \(2020\)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert\-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei\. 2020\.Language Models are Few\-Shot Learners\.In*Advances in Neural Information Processing Systems*, volume 33, pages 1877–1901\. Curran Associates, Inc\.
- Dettmers \(2025\)Tim Dettmers\. 2025\.[Why AGI will not happen](https://timdettmers.com/2025/12/10/why-agi-will-not-happen)\.Blog post,Tim Dettmers\. Accessed: 2026\-05\-06\.
- Hoffmann et al\. \(2022\)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Thomas Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karén Simonyan, Erich Elsen, Oriol Vinyals, Jack Rae, and Laurent Sifre\. 2022\.Training Compute\-Optimal Large Language Models\.*Advances in Neural Information Processing Systems*, 35:30016–30030\.
- Kaplan et al\. \(2020\)Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\. 2020\.[Scaling Laws for Neural Language Models](https://doi.org/10.48550/arXiv.2001.08361)\.Computing Research Repository, arXiv:2001\.08361\.
- Leontief \(1941\)Wassily Leontief\. 1941\.*The Structure of American Economy, 1919–1929: An Empirical Application of Equilibrium Analysis*\.Harvard University Press, Cambridge, MA, USA\.
- Merrill et al\. \(2026\)William Merrill, Yanhong Li, Tyler Romero, Anej Svete, Caia Costello, Pradeep Dasigi, Dirk Groeneveld, David Heineman, Bailey Kuehl, Nathan Lambert, Chuan Li, Kyle Lo, Saumya Malik, D\. J\. Matusz, Benjamin Minixhofer, Jacob Morrison, Luca Soldaini, Finbarr Timbers, Pete Walsh, Noah A\. Smith, Hannaneh Hajishirzi, and Ashish Sabharwal\. 2026\.[Olmo Hybrid: From Theory to Practice and Back](https://doi.org/10.48550/arXiv.2604.03444)\.Computing Research Repository, arXiv:2604\.03444\.
- Michaud et al\. \(2023\)Eric Michaud, Ziming Liu, Uzay Girit, and Max Tegmark\. 2023\.The Quantization Model of Neural Scaling\.In*Advances in Neural Information Processing Systems*, volume 36, pages 28699–28722\. Curran Associates, Inc\.
- Narayanan and Kapoor \(2025\)Arvind Narayanan and Sayash Kapoor\. 2025\.[AI as normal technology: An alternative to the vision of AI as a potential superintelligence](https://kfai-documents.s3.amazonaws.com/documents/c3cac5a2a7/AI-as-Normal-Technology---Narayanan---Kapoor.pdf)\.Technical report, Knight First Amendment Institute, Columbia University\.
- Srivastava et al\. \(2023\)Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R\. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga\-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W\. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S\. Iyer, Anders Johan Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew M\. Dai, Andrew La, Andrew Kyle Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş, B\. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, Cesar Ferri, Chandan Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison\-Burch, Christopher Waites, Christian Voigt, Christopher D Manning, Christopher Potts, Cindy Ramirez, Clara E\. Rivera, Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, C\. Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong\-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodolà, Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A Chi, Ethan Dyer, Ethan Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez\-Plumed, Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germàn Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Xinyue Wang, Gonzalo Jaimovitch\-Lopez, Gregor Betz, Guy Gur\-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar, Henry Francis Anthony Shevlin, Hinrich Schuetze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B Simon, James Koppel, James Zheng, James Zou, Jan Kocon, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl\-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U\. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos Rozen, Jose Hernandez\-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B\. Tenenbaum, Joshua S\. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh Dhole, Kevin Gimpel, Kevin Omondi, Kory Wallace Mathewson, Kristen Chiafullo, Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras\-Ochando, Louis\-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble, Ludwig Schmidt, Luheng He, Luis Oliveros\-Colón, Luke Metz, Lütfi Kerem Senel, Maarten Bosma, Maarten Sap, Maartje Ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco Maru, Maria Jose Ramirez\-Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast, Matthew L Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael Andrew Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube, Michał Swędrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma T, Nanyun Peng, Nathan Andrew Chi, Nayeon Lee, Neta Gur\-Ari Krakover, Nicholas Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S\. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter W Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang, Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg, Richard Barnes, Rif A\. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan Le Bras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Russ Salakhutdinov, Ryan Andrew Chi, Seungjae Ryan Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M\. Mohammad, Sajant Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R\. Bowman, Samuel Stern Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A\. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima Shammie Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini, Soo\-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen Prasad, Steven Piantadosi, Stuart Shieber, Summer Misherghi, Svetlana Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsunori Hashimoto, Te\-Lin Wu, Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Venkatesh Ramasesh, vinay uday prabhu, Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Sophie Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J\. Wang, Zirui Wang, and Ziyi Wu\. 2023\.Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models\.*Transactions on Machine Learning Research*\.
- Varian and Melitz \(2024\)Hal R\. Varian and Marc Melitz\. 2024\.*Intermediate Microeconomics: A Modern Approach*, 10 edition\.W\. W\. Norton & Company, New York, NY, USA\.
- Wei et al\. \(2022\)Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H\. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus\. 2022\.Emergent abilities of large language models\.*Transactions on Machine Learning Research*\.

## Appendix ADerivation of Inverse\-Demand Linking Functions

Here we derive the form of the inverse demand linking functions by evaluating the integral from[Subsection 2\.1](https://arxiv.org/html/2605.16430#S2.SS1)\.

f0\(q\)\\displaystyle f\_\{0\}\(q\)=∫1q1q∗dq∗=ln⁡\(q\)\\displaystyle=\\int\_\{1\}^\{q\}\\frac\{1\}\{q\_\{\*\}\}\\,\\mathrm\{d\}q\_\{\*\}=\\ln\(q\)fγ\(q\)\\displaystyle f\_\{\\gamma\}\(q\)=∫1q1q∗1\+γdq∗=−q∗−γγ\]q1=1γ\(1−q−γ\)\\displaystyle=\\int\_\{1\}^\{q\}\\frac\{1\}\{q\_\{\*\}^\{1\+\\gamma\}\}\\,\\mathrm\{d\}q\_\{\*\}=\-\\frac\{q\_\{\*\}^\{\-\\gamma\}\}\{\\gamma\}\\bigg\]^\{q\}\_\{1\}=\\frac\{1\}\{\\gamma\}\\left\(1\-q^\{\-\\gamma\}\\right\)

## Appendix BChinchilla Elasticity of Substitution

Here we derive the elasticity of substitution ofσ≈\.76\\sigma\\approx\.76for[Hoffmann et al\.](https://arxiv.org/html/2605.16430#bib.bib8)’s \([2022](https://arxiv.org/html/2605.16430#bib.bib8)\)Chinchilla scaling law\.

### B\.1Chinchilla Scaling Law

According toHoffmann et al\. \([2022](https://arxiv.org/html/2605.16430#bib.bib8)\), the next\-token prediction lossℓ\\ellof an LLM is predicted by

ℓ\(n,d\)=ℓ∗\+anα\+bdβ\\ell\(n,d\)=\\ell\_\{\*\}\+\\frac\{a\}\{n^\{\\alpha\}\}\+\\frac\{b\}\{d^\{\\beta\}\}where theirreducible lossℓ∗\\ell\_\{\*\}is the best possible loss that can be achieved by an LLM\. Since LLMs with lower loss are generally considered to be of higher quality, we assume that the quality of an LLM is inversely proportional to the amount by which its loss exceedsℓ∗\\ell\_\{\*\}\. Therefore, the Chinchilla scaling law is given by

q\(n,d\)=1ℓ\(n,d\)−ℓ∗=1anα\+bdβ\.q\(n,d\)=\\frac\{1\}\{\\ell\(n,d\)\-\\ell\_\{\*\}\}=\\frac\{1\}\{\\frac\{a\}\{n^\{\\alpha\}\}\+\\frac\{b\}\{d^\{\\beta\}\}\}\.

### B\.2Elasticity of Substitution

Whenq\(n,d\)q\(n,d\)is differentiable, the elasticity of substitution ofqqis given by the formula

σ=dln⁡\(d/n\)dln⁡\(MRTS\)\\sigma=\\frac\{\\mathrm\{d\}\\ln\(d/n\)\}\{\\mathrm\{d\}\\ln\(\\operatorname\{MRTS\}\)\}where

MRTS=∂q\(n,d\)/∂n∂q\(n,d\)/∂d\\operatorname\{MRTS\}=\\frac\{\\partial q\(n,d\)/\\partial n\}\{\\partial q\(n,d\)/\\partial d\}isqq’smarginal rate of technical substitution\(MRTS\)\.

The differentialdln⁡\(d/n\)\\mathrm\{d\}\\ln\(d/n\)is calculated using implicit differentiation along the curve whereq\(n,d\)q\(n,d\)is constant:

0=d\(anα\+bdβ\)=−aαn−α−1dn−bβd−β−1dd⟹dd=−aαdβ\+1bβnα\+1dn,0=\\mathrm\{d\}\\left\(\\frac\{a\}\{n^\{\\alpha\}\}\+\\frac\{b\}\{d^\{\\beta\}\}\\right\)=\-a\\alpha n^\{\-\\alpha\-1\}\\mathrm\{d\}n\-b\\beta d^\{\-\\beta\-1\}\\mathrm\{d\}d\\implies\\mathrm\{d\}d=\-\\frac\{a\\alpha d^\{\\beta\+1\}\}\{b\\beta n^\{\\alpha\+1\}\}\\mathrm\{d\}n,hence

dln⁡\(dn\)=d\(ln⁡\(d\)−ln⁡\(n\)\)=ddd−dnn=\(−aαdβbβnα\+1−1n\)dn\.\\mathrm\{d\}\\ln\\left\(\\frac\{d\}\{n\}\\right\)=\\mathrm\{d\}\(\\ln\(d\)\-\\ln\(n\)\)=\\frac\{\\mathrm\{d\}d\}\{d\}\-\\frac\{\\mathrm\{d\}n\}\{n\}=\\left\(\-\\frac\{a\\alpha d^\{\\beta\}\}\{b\\beta n^\{\\alpha\+1\}\}\-\\frac\{1\}\{n\}\\right\)\\mathrm\{d\}n\.
The MRTS is calculated as follows:

∂q\(n,d\)∂n\\displaystyle\\frac\{\\partial q\(n,d\)\}\{\\partial n\}=aαn1\+α\(an−α\+bd−β\)2\\displaystyle=\\frac\{a\\alpha\}\{n^\{1\+\\alpha\}\(an^\{\-\\alpha\}\+bd^\{\-\\beta\}\)^\{2\}\}∂q\(n,d\)∂d\\displaystyle\\frac\{\\partial q\(n,d\)\}\{\\partial d\}=bβd1\+β\(an−α\+bd−β\)2\\displaystyle=\\frac\{b\\beta\}\{d^\{1\+\\beta\}\(an^\{\-\\alpha\}\+bd^\{\-\\beta\}\)^\{2\}\}MRTS\\displaystyle\\operatorname\{MRTS\}=aαd1\+βbβn1\+α,\\displaystyle=\\frac\{a\\alpha d^\{1\+\\beta\}\}\{b\\beta n^\{1\+\\alpha\}\},hence

dln⁡\(MRTS\)\\displaystyle\\mathrm\{d\}\\ln\(\\operatorname\{MRTS\}\)=dln⁡\(aαd1\+β\)−dln⁡\(bβn1\+α\)\\displaystyle=\\mathrm\{d\}\\ln\(a\\alpha d^\{1\+\\beta\}\)\-\\mathrm\{d\}\\ln\(b\\beta n^\{1\+\\alpha\}\)=β\+1ddd−α\+1ndn\\displaystyle=\\frac\{\\beta\+1\}\{d\}\\mathrm\{d\}d\-\\frac\{\\alpha\+1\}\{n\}\\mathrm\{d\}n=−\(aα\(β\+1\)dβbβnα\+1\+α\+1n\)dn,\\displaystyle=\-\\left\(\\frac\{a\\alpha\(\\beta\+1\)d^\{\\beta\}\}\{b\\beta n^\{\\alpha\+1\}\}\+\\frac\{\\alpha\+1\}\{n\}\\right\)\\mathrm\{d\}n,and therefore

σ=dln⁡\(d/n\)dln⁡\(MRTS\)=\(−aαdβbβnα\+1−1n\)dn−\(aα\(β\+1\)dβbβnα\+1\+α\+1n\)dn=aαdβ\+bβnαaα\(β\+1\)dβ\+\(α\+1\)bβnα\.\\sigma=\\frac\{\\mathrm\{d\}\\ln\(d/n\)\}\{\\mathrm\{d\}\\ln\(\\operatorname\{MRTS\}\)\}=\\frac\{\\left\(\-\\frac\{a\\alpha d^\{\\beta\}\}\{b\\beta n^\{\\alpha\+1\}\}\-\\frac\{1\}\{n\}\\right\)\\mathrm\{d\}n\}\{\-\\left\(\\frac\{a\\alpha\(\\beta\+1\)d^\{\\beta\}\}\{b\\beta n^\{\\alpha\+1\}\}\+\\frac\{\\alpha\+1\}\{n\}\\right\)\\mathrm\{d\}n\}=\\frac\{a\\alpha d^\{\\beta\}\+b\\beta n^\{\\alpha\}\}\{a\\alpha\(\\beta\+1\)d^\{\\beta\}\+\(\\alpha\+1\)b\\beta n^\{\\alpha\}\}\.Whenα=β\\alpha=\\beta, the above simplifies to

σ=11\+α\.\\sigma=\\frac\{1\}\{1\+\\alpha\}\.Hoffmann et al\. \([2022](https://arxiv.org/html/2605.16430#bib.bib8)\)reportα=\.3392\\alpha=\.3392andβ=\.2849\\beta=\.2849; taking the average of these values,α=\.31205\\alpha=\.31205, yieldsσ≈\.7622\\sigma\\approx\.7622\.

## Appendix CProof of Log\-Quasilinear Demand Result

Throughout this section, we assume thatn∗n^\{\*\}is determined by the first\-order conditionπ′\(n∗\)=0\\pi^\{\\prime\}\(n^\{\*\}\)=0\. To simplify notation, letF≜\(f∘q\)\(n\)F\\triangleq\(f\\circ q\)\(n\)andF′≜\(f∘q\)′\(n\)F^\{\\prime\}\\triangleq\(f\\circ q\)^\{\\prime\}\(n\), whereq\(n\)=anαq\(n\)=an^\{\\alpha\}\.

The derivative of the profit function is

π′\(n\)=ω22δFF′−ωδE\(F\+F′n\)\+2E\(1δE−6ρ\)n\.\\pi^\{\\prime\}\(n\)=\\frac\{\\omega^\{2\}\}\{2\\delta\}FF^\{\\prime\}\-\\frac\{\\omega\}\{\\delta E\}\(F\+F^\{\\prime\}n\)\+\\frac\{2\}\{E\}\\left\(\\frac\{1\}\{\\delta E\}\-6\\rho\\right\)n\.See[2](https://arxiv.org/html/2605.16430#Thmtheorem2)

###### Proof\.

Sinceγ=0\\gamma=0, we haveF=αln⁡n\+ln⁡aF=\\alpha\\ln n\+\\ln aandF′=α/nF^\{\\prime\}=\\alpha/n\. We have the first\-order conditions

0=π′\(n\)\\displaystyle 0=\\pi^\{\\prime\}\(n\)=α2ω22δln⁡n\+ln⁡an−αωδE\(ln⁡n\+ln⁡a\+1\)\+2E\(1δE−6ρ\)n\\displaystyle=\\frac\{\\alpha^\{2\}\\omega^\{2\}\}\{2\\delta\}\\frac\{\\ln n\+\\ln a\}\{n\}\-\\frac\{\\alpha\\omega\}\{\\delta E\}\(\\ln n\+\\ln a\+1\)\+\\frac\{2\}\{E\}\\left\(\\frac\{1\}\{\\delta E\}\-6\\rho\\right\)n⇔0\\displaystyle\\iff 0=α2ω22δln⁡n\+ln⁡an2−αωδEln⁡n\+ln⁡a\+1n\+2E\(1δE−6ρ\)\.\\displaystyle=\\frac\{\\alpha^\{2\}\\omega^\{2\}\}\{2\\delta\}\\frac\{\\ln n\+\\ln a\}\{n^\{2\}\}\-\\frac\{\\alpha\\omega\}\{\\delta E\}\\frac\{\\ln n\+\\ln a\+1\}\{n\}\+\\frac\{2\}\{E\}\\left\(\\frac\{1\}\{\\delta E\}\-6\\rho\\right\)\.Forx\>0x\>0,ln⁡x\+ln⁡ax\\frac\{\\ln x\+\\ln a\}\{x\}is maximized whenx=eax=\\frac\{e\}\{a\}atae\\frac\{a\}\{e\}\. Thus, we boundln⁡n\+ln⁡an≤ae\\frac\{\\ln n\+\\ln a\}\{n\}\\leq\\frac\{a\}\{e\}in the first term:

0≤α2ω2a2δen−αωδEln⁡n\+ln⁡a\+1n\+2E\(1δE−6ρ\)\.0\\leq\\frac\{\\alpha^\{2\}\\omega^\{2\}a\}\{2\\delta en\}\-\\frac\{\\alpha\\omega\}\{\\delta E\}\\frac\{\\ln n\+\\ln a\+1\}\{n\}\+\\frac\{2\}\{E\}\\left\(\\frac\{1\}\{\\delta E\}\-6\\rho\\right\)\.Next, we can apply0≤ln⁡n0\\leq\\ln nin the second negative term to get

0\\displaystyle 0≤α2ω2a2δen−αω\(ln⁡a\+1\)δEn\+2E\(1δE−6ρ\)\\displaystyle\\leq\\frac\{\\alpha^\{2\}\\omega^\{2\}a\}\{2\\delta en\}\-\\frac\{\\alpha\\omega\(\\ln a\+1\)\}\{\\delta En\}\+\\frac\{2\}\{E\}\\left\(\\frac\{1\}\{\\delta E\}\-6\\rho\\right\)=\(α2ω2a2δe−αω\(ln⁡a\+1\)δE\)1n\+2E\(1δE−6ρ\)\\displaystyle=\\left\(\\frac\{\\alpha^\{2\}\\omega^\{2\}a\}\{2\\delta e\}\-\\frac\{\\alpha\\omega\(\\ln a\+1\)\}\{\\delta E\}\\right\)\\frac\{1\}\{n\}\+\\frac\{2\}\{E\}\\left\(\\frac\{1\}\{\\delta E\}\-6\\rho\\right\)⟹−2E\(1δE−6ρ\)n\\displaystyle\\implies\-\\frac\{2\}\{E\}\\left\(\\frac\{1\}\{\\delta E\}\-6\\rho\\right\)n≤α2ω2a2δe−αω\(ln⁡a\+1\)δE\\displaystyle\\leq\\frac\{\\alpha^\{2\}\\omega^\{2\}a\}\{2\\delta e\}\-\\frac\{\\alpha\\omega\(\\ln a\+1\)\}\{\\delta E\}\(12ρE−2δE2\)n\\displaystyle\\left\(\\frac\{12\\rho\}\{E\}\-\\frac\{2\}\{\\delta E^\{2\}\}\\right\)n≤α2ω2a2δe−αω\(ln⁡a\+1\)δE\\displaystyle\\leq\\frac\{\\alpha^\{2\}\\omega^\{2\}a\}\{2\\delta e\}\-\\frac\{\\alpha\\omega\(\\ln a\+1\)\}\{\\delta E\}\(6δρ−1E\)n\\displaystyle\\left\(6\\delta\\rho\-\\frac\{1\}\{E\}\\right\)n≤α2ω2aE4e−αω2\(ln⁡a\+1\)\.\\displaystyle\\leq\\frac\{\\alpha^\{2\}\\omega^\{2\}aE\}\{4e\}\-\\frac\{\\alpha\\omega\}\{2\}\(\\ln a\+1\)\.SinceE\>1/\(6δρ\)E\>1/\(6\\delta\\rho\)by assumption, we can divide both sides to get

n∗\\displaystyle n^\{\*\}≤α2ω24e\(6δρ−1E\)aE−αω2\(6δρ−1E\)\(ln⁡a\+1\)\.\\displaystyle\\leq\\frac\{\\alpha^\{2\}\\omega^\{2\}\}\{4e\\left\(6\\delta\\rho\-\\frac\{1\}\{E\}\\right\)\}aE\-\\frac\{\\alpha\\omega\}\{2\(6\\delta\\rho\-\\frac\{1\}\{E\}\)\}\(\\ln a\+1\)\.Absorbing constants, we have the following asymptotics for largeEE:

n∗=O\(aEρ\)\.n^\{\*\}=O\\left\(\\frac\{aE\}\{\\rho\}\\right\)\.By[Lemma 2](https://arxiv.org/html/2605.16430#Thmlemma2)and sinceα=β\\alpha=\\beta, we can characterized∗d^\{\*\}as

d∗=ρn∗=O\(aE\)\.d^\{\*\}=\\rho n^\{\*\}=O\\left\(aE\\right\)\.Putting it all together, the optimal training compute allocation scales comparably to data:

Ctrain∗=6n∗d∗E=O\(a2Eρ\)\.∎C^\{\*\}\_\{\\operatorname\{train\}\}=\\frac\{6n^\{\*\}d^\{\*\}\}\{E\}=O\\left\(\\frac\{a^\{2\}E\}\{\\rho\}\\right\)\.\\qed

## Appendix DProof of Poly\-Quasilinear Demand Result

See[1](https://arxiv.org/html/2605.16430#Thmtheorem1)

###### Proof\.

The derivative of the profit function is given by

π′\(n\)=αω2n−αγ−12aγγδ−αω2n−2αγ−12a2γγδ\+\(1γ−α\)ωn−αγaγEδ\+\(1Eδ−6ρ\)2nE−ωEγδ\.\\pi^\{\\prime\}\(n\)=\\frac\{\\alpha\\omega^\{2\}n^\{\-\\alpha\\gamma\-1\}\}\{2a^\{\\gamma\}\\gamma\\delta\}\-\\frac\{\\alpha\\omega^\{2\}n^\{\-2\\alpha\\gamma\-1\}\}\{2a^\{2\\gamma\}\\gamma\\delta\}\+\\left\(\\frac\{1\}\{\\gamma\}\-\\alpha\\right\)\\frac\{\\omega n^\{\-\\alpha\\gamma\}\}\{a^\{\\gamma\}E\\delta\}\+\\left\(\\frac\{1\}\{E\\delta\}\-6\\rho\\right\)\\frac\{2n\}\{E\}\-\\frac\{\\omega\}\{E\\gamma\\delta\}\.From this, we derive

0=αωn∗−αγ−22aγδ⏟T1\-αωn∗−2αγ−22a2γδ⏟T2\+\(1−αγ\)n∗−αγ−1aγEδ⏟T3\+\(1Eδ−6ρ\)2γωE⏟T4\-n∗−1Eδ⏟T50=\\underbrace\{\\frac\{\\alpha\\omega\{n^\{\*\}\}^\{\-\\alpha\\gamma\-2\}\}\{2a^\{\\gamma\}\\delta\}\}\_\{T\_\{1\}\}\\underbrace\{\\mathrel\{\-\}\\frac\{\\alpha\\omega\{n^\{\*\}\}^\{\-2\\alpha\\gamma\-2\}\}\{2a^\{2\\gamma\}\\delta\}\}\_\{T\_\{2\}\}\+\\underbrace\{\\left\(1\-\\alpha\\gamma\\right\)\\frac\{\{n^\{\*\}\}^\{\-\\alpha\\gamma\-1\}\}\{a^\{\\gamma\}E\\delta\}\}\_\{T\_\{3\}\}\+\\underbrace\{\\left\(\\frac\{1\}\{E\\delta\}\-6\\rho\\right\)\\frac\{2\\gamma\}\{\\omega E\}\}\_\{T\_\{4\}\}\\underbrace\{\\mathrel\{\-\}\\frac\{\{n^\{\*\}\}^\{\-1\}\}\{E\\delta\}\}\_\{T\_\{5\}\}\(1\)by multiplying both sides of the first\-order conditionπ′\(n∗\)=0\\pi^\{\\prime\}\(n^\{\*\}\)=0byγ/\(n∗ω\)\\gamma/\(n^\{\*\}\\omega\)\.

To obtain the upper bound, we observe thatT2,T5<0T\_\{2\},T\_\{5\}<0\. Thus, subtractingT2\+T5T\_\{2\}\+T\_\{5\}from the right\-hand side of Equation \([1](https://arxiv.org/html/2605.16430#A4.E1)\) gives us:

0\\displaystyle 0≤αωn∗−αγ−22aγδ\+\(1−αγ\)n∗−αγ−1aγEδ\+\(1Eδ−6ρ\)2γωE\\displaystyle\\leq\\frac\{\\alpha\\omega\{n^\{\*\}\}^\{\-\\alpha\\gamma\-2\}\}\{2a^\{\\gamma\}\\delta\}\+\\left\(1\-\\alpha\\gamma\\right\)\\frac\{\{n^\{\*\}\}^\{\-\\alpha\\gamma\-1\}\}\{a^\{\\gamma\}E\\delta\}\+\\left\(\\frac\{1\}\{E\\delta\}\-6\\rho\\right\)\\frac\{2\\gamma\}\{\\omega E\}≤αωn∗−αγ−12aγδ\+\(1−αγ\)n∗−αγ−1aγEδ\+\(1Eδ−6ρ\)2γωE\\displaystyle\\leq\\frac\{\\alpha\\omega\{n^\{\*\}\}^\{\-\\alpha\\gamma\-1\}\}\{2a^\{\\gamma\}\\delta\}\+\\left\(1\-\\alpha\\gamma\\right\)\\frac\{\{n^\{\*\}\}^\{\-\\alpha\\gamma\-1\}\}\{a^\{\\gamma\}E\\delta\}\+\\left\(\\frac\{1\}\{E\\delta\}\-6\\rho\\right\)\\frac\{2\\gamma\}\{\\omega E\}=\(αω2aγδ\+1−αγaγEδ\)n∗−\(αγ\+1\)\+\(1Eδ−6ρ\)2γωE\.\\displaystyle=\\left\(\\frac\{\\alpha\\omega\}\{2a^\{\\gamma\}\\delta\}\+\\frac\{1\-\\alpha\\gamma\}\{a^\{\\gamma\}E\\delta\}\\right\)\{n^\{\*\}\}^\{\-\(\\alpha\\gamma\+1\)\}\+\\left\(\\frac\{1\}\{E\\delta\}\-6\\rho\\right\)\\frac\{2\\gamma\}\{\\omega E\}\.Solving forn∗n^\{\*\}, we have

−\(1Eδ−6ρ\)2γωE\(αω2aγδ\+1−αγaγEδ\)−1\\displaystyle\-\\left\(\\frac\{1\}\{E\\delta\}\-6\\rho\\right\)\\frac\{2\\gamma\}\{\\omega E\}\\left\(\\frac\{\\alpha\\omega\}\{2a^\{\\gamma\}\\delta\}\+\\frac\{1\-\\alpha\\gamma\}\{a^\{\\gamma\}E\\delta\}\\right\)^\{\-1\}≤n∗−\(αγ\+1\)\\displaystyle\\leq\{n^\{\*\}\}^\{\-\(\\alpha\\gamma\+1\)\}⟹n∗αγ\+1\\displaystyle\\implies\{n^\{\*\}\}^\{\\alpha\\gamma\+1\}≤ωE2γδaγ\(6ρ−1Eδ\)−1\(αω2\+1−αγE\)\\displaystyle\\leq\\frac\{\\omega E\}\{2\\gamma\\delta a^\{\\gamma\}\}\\left\(6\\rho\-\\frac\{1\}\{E\\delta\}\\right\)^\{\-1\}\\left\(\\frac\{\\alpha\\omega\}\{2\}\+\\frac\{1\-\\alpha\\gamma\}\{E\}\\right\)=ωE2γδaγ⋅αωE\+2−2αγ2E\(6ρ−1Eδ\)=O\(Eρaγ\)\.\\displaystyle=\\frac\{\\omega E\}\{2\\gamma\\delta a^\{\\gamma\}\}\\cdot\\frac\{\\alpha\\omega E\+2\-2\\alpha\\gamma\}\{2E\\left\(6\\rho\-\\frac\{1\}\{E\\delta\}\\right\)\}=O\\left\(\\frac\{E\}\{\\rho a^\{\\gamma\}\}\\right\)\.Therefore we obtain

n∗=O\(\(Eρaγ\)1/\(αγ\+1\)\)∎n^\{\*\}=O\\left\(\\left\(\\frac\{E\}\{\\rho a^\{\\gamma\}\}\\right\)^\{1/\(\\alpha\\gamma\+1\)\}\\right\)\\qedWe can use[Lemma 2](https://arxiv.org/html/2605.16430#Thmlemma2)to boundd∗d^\{\*\}as

d∗=ρn∗=O\(\(ραγEaγ\)1/\(αγ\+1\)\)\.d^\{\*\}=\\rho n^\{\*\}=O\\left\(\\left\(\\frac\{\\rho^\{\\alpha\\gamma\}E\}\{a^\{\\gamma\}\}\\right\)^\{1/\(\\alpha\\gamma\+1\)\}\\right\)\.Putting it all together, we get

Ctrain∗=6n∗d∗E=O\(\(Eaγ\)1/\(αγ\+1\)\)\.C^\{\*\}\_\{\\operatorname\{train\}\}=\\frac\{6n^\{\*\}d^\{\*\}\}\{E\}=O\\left\(\\left\(\\frac\{E\}\{a^\{\\gamma\}\}\\right\)^\{1/\(\\alpha\\gamma\+1\)\}\\right\)\.
A Theory of Training Profit-Optimal LLMs

Similar Articles

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

Generating Robust Portfolios of Optimization Models using Large Language Models

Scaling laws for neural language models

Why can't LLMs be trained to think in an optimized AI language rather than English?

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

Submit Feedback

Similar Articles

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws
Generating Robust Portfolios of Optimization Models using Large Language Models
Scaling laws for neural language models
Why can't LLMs be trained to think in an optimized AI language rather than English?
LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance