Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

arXiv cs.LG 06/08/26, 04:00 AM Papers
scaling-laws regularization data-constrained language-model pretraining masked-input autoregressive
Summary
This paper studies data-constrained language model pretraining, proposing masked-input regularization (MIR) to improve validation loss and downstream performance, and SoftQ, a scaling law that better captures model-data interaction under repeated data.
arXiv:2606.06888v1 Announce Type: new Abstract: Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization and scaling. For regularization, we study masked-input regularization (MIR), an auxiliary next-token prediction loss on randomly masked inputs. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead. Across 72M to 1.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong-weight-decay-only models, with downstream gains at 1.4B. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data-constrained regime. We find that SoftQ fits data-constrained experiments substantially better than these alternatives, and estimates MIR's gains as equivalent to roughly 1.3 times as much unique training data. We release our code at https://github.com/yixinw-lab/dc_pretrain.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:19 AM
# Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
Source: [https://arxiv.org/html/2606.06888](https://arxiv.org/html/2606.06888)
Zhiwei Xu11, Shihao Wu11, Hanseul Cho22, Wei Hu11, Yixin Wang11∗ 11University of Michigan,22KAIST AI \{zhiweixu,wshihao,vvh,yixinw\}@umich\.edu,jhs4015@kaist\.ac\.kr

###### Abstract

Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus\. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data\-constrained, compute\-rich regime where models train for multiple epochs over a finite dataset\. We study data\-constrained pretraining along two axes, regularization and scaling\. For regularization, we study masked\-input regularization \(MIR\), an auxiliary next\-token prediction loss on randomly masked inputs\. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead\. Across 72M to 1\.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong\-weight\-decay\-only models, with downstream gains at 1\.4B\. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data\. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data\-constrained regime\. We find that SoftQ fits data\-constrained experiments substantially better than these alternatives, and estimates MIR’s gains as equivalent to roughly 1\.3 times as much unique training data\. We release our code at[https://github\.com/yixinw\-lab/dc\_pretrain](https://github.com/yixinw-lab/dc_pretrain)\.

## 1Introduction

Scaling laws\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.06888#bib.bib30); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.06888#bib.bib29)\)are widely used to choose model size and training\-token budget for large language model pretraining\. Classical scaling laws are largely compute\-centric: they study how to allocate a fixed compute budget between parameters and tokens, assuming that unique training data can scale freely with compute\. In this abundant\-data setting, pretraining is typically performed with a single pass over a large corpus\.

However, training compute is growing faster than the supply of natural language data\(Villaloboset al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib13); Sevilla and Roldán,[2024](https://arxiv.org/html/2606.06888#bib.bib14); Common Crawl,[2025](https://arxiv.org/html/2606.06888#bib.bib15)\), making data\-constrained, compute\-rich pretraining increasingly important\. In this regime, the unique dataset is fixed, and additional compute is spent on larger models and multiple passes over the same corpus\. Prior work has begun to study this setting:Muennighoffet al\.\([2023](https://arxiv.org/html/2606.06888#bib.bib23)\)tuned data repetition while fixing weight decay to 0\.1 and proposed scaling laws based on effective resources that saturate with repetitions and excess parameters;Kimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)further showed that large weight decay is critical for preventing overfitting\.

This shift raises two linked questions\. The first concerns regularization: how can models avoid overfitting when compute increases but unique data does not? Prior work points to strong weight decay as one answer\. A second possibility comes from masked diffusion language models \(dLLMs\), which typically use the same transformer architecture as autoregressive \(AR\) models but train by predicting randomly masked tokens\. Under identical hyperparameters, dLLMs achieve lower validation loss than AR transformers in the data\-constrained regime\(Niet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib35); Prabhudesaiet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib32)\), suggesting that random masking may itself act as a form of regularization\. However, these comparisons do not isolate masking from regularization strength: the dLLM advantage may be complementary to strong weight decay, or it may largely reflect insufficiently strong regularization in the AR baseline\. This motivates our first question: how do random masking and weight decay interact, and how much does each contribute on top of the other?

![Refer to caption](https://arxiv.org/html/2606.06888v1/x1.png)\(a\)MIR improves a strong AR baseline\.
![Refer to caption](https://arxiv.org/html/2606.06888v1/x2.png)\(b\)SoftQ captures data–model coupling\.

Figure 1:Overview of the main results\.Left: On DataComp\-LM \(DCLM\) dataset\(Liet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib18)\)with100M100\\mathrm\{M\}unique training tokens, MIR improves validation loss over the strongly regularized autoregressive baseline across model sizes\. Points show means over five random seeds, error bars show one standard deviation, and faint markers show individual runs\. Right: On the strongly regularized baseline grid, we plot the loss gapL\(N,U\)−L\(N,400M\)L\(N,U\)\-L\(N,400\\mathrm\{M\}\)for unique data budgetU∈\{100M,200M,300M\}U\\in\\\{100\\mathrm\{M\},200\\mathrm\{M\},300\\mathrm\{M\}\\\}\. Chinchilla predicts a model\-size\-invariant gap for eachUU, while SoftQ tracks the empirical fan\-out: the penalty from limited unique data grows with model size\.The second question concerns scaling: what loss law describes the data\-constrained, compute\-rich regime? Chinchilla\-style laws were fit to single\-pass, abundant\-data training and may not capture the validation\-loss surface when unique data, rather than compute, is the binding resource\. In particular, their additive form predicts that the loss gap between two unique\-data budgets should be independent of model size\. In this paper, we study both questions in the data\-constrained, compute\-rich regime\.

Finding 1: Random masking provides regularization complementary to strong weight decay\.We first ask how the two regularization mechanisms interact\. We find that strong weight decay is not specific to AR pretraining: applying the AR\-tuned weight decay to dLLMs substantially lowers their validation loss, and once both models are strongly regularized, their validation losses become comparable across the model sizes we study\. Given that strong weight decay alone provides such substantial regularization, this makes it unclear whether random masking can still provide additional benefit once strong weight decay is already in use\.

To isolate this effect, we study*masked\-input regularization*\(MIR\), a minimal modification to standard AR pretraining\. Letxxdenote a clean sequence andx~\\widetilde\{x\}a randomly masked version of the same sequence\. Instead of optimizing only the standard next\-token prediction lossℒNTP\(x\)\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(x\), MIR optimizes

ℒ=ℒNTP\(x\)\+λℒNTP\(x~\)\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(x\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(\\widetilde\{x\}\)\.Thus, the model trains on both clean and masked inputs, using the masked\-input loss as an auxiliary regularizer\. MIR requires no architectural changes and preserves standard autoregressive decoding at inference\. Although it increases training compute, our setting is data\-constrained and compute\-rich, so we study MIR as a way to improve loss at a fixed unique\-data budget, i\.e\., data efficiency rather than compute efficiency\.

Across models from 72M to 1\.4B parameters trained on DCLM\(Liet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib18)\)and Stack\-V2\(Lozhkovet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib19)\), MIR consistently improves validation loss on top of strong weight decay \(Figure[1\(a\)](https://arxiv.org/html/2606.06888#S1.F1.sf1)\)\. At 1\.4B parameters, it also yields substantial downstream gains, including \+10\.2 points on BoolQ and \+2\.2 points on SciQ\.

Finding 2: Chinchilla is misspecified in the data\-constrained, compute\-rich regime; a coupled scaling law fits better\.To quantify how much unique data MIR is worth, we extend our experiments across five model sizes and four unique\-data budgets and fit several scaling laws\. The additive Chinchilla form\(Hoffmannet al\.,[2022](https://arxiv.org/html/2606.06888#bib.bib29)\)fits poorly in this regime: it predicts that the validation\-loss gap between two data budgets is independent of model size, whereas our experiments show that this gap grows with model size \(Figure[1\(b\)](https://arxiv.org/html/2606.06888#S1.F1.sf2)\)\.

We propose the*SoftQ scaling law*, a five\-parameter form that couples model size and data size through a soft bottleneck motivated by the skill\-learning view of scaling laws\(Michaud,[2026](https://arxiv.org/html/2606.06888#bib.bib16)\)\. SoftQ achieves better in\-sample fit and out\-of\-sample prediction than Chinchilla, Quanta\(Michaud,[2026](https://arxiv.org/html/2606.06888#bib.bib16)\), and Muennighoff\-style\(Muennighoffet al\.,[2023](https://arxiv.org/html/2606.06888#bib.bib23)\)laws on our dataset\. The same ranking holds on an independent dataset fromKimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)\. Using SoftQ as the baseline scaling law, we estimate MIR’s gain over the strongly regularized baseline to be equivalent to roughly1\.3×1\.3\\timesas much unique training data at the 200M–400M token budgets\.

Contributions\.We summarize our contributions as follows: \(i\) We show that large weight decay substantially improves dLLMs in the data\-constrained regime, and that random masking further improves strongly regularized AR models\. Building on this observation, we propose MIR, a minimal recipe that augments strongly regularized AR pretraining with an auxiliary masked\-input next\-token loss; we estimate MIR to be worth roughly1\.3×1\.3\\timesas much unique training data at the 200M to 400M token budgets\. \(ii\) We show that additive Chinchilla\-style scaling laws do not fit the data\-constrained, compute\-rich regime, and propose SoftQ, a five\-parameter scaling law that couples model and data size and substantially outperforms these alternatives\.

## 2Setup: Data\-Constrained Autoregressive and Masked Pretraining

### 2\.1Data\-Constrained and Compute\-Rich Pretraining

LetNNdenote the number of model parameters,UUthe number of unique pretraining tokens,NEN\_\{E\}the number of epochs over those tokens, andD=UNED=UN\_\{E\}the total number of training tokens\. For a standard dense decoder\-only transformer trained with next\-token prediction, the training compute is approximatelyC\(N,D\)≈6ND\.C\(N,D\)\\approx 6ND\.

Classical compute\-optimal scaling laws\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.06888#bib.bib30); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.06888#bib.bib29)\)model evaluation loss as a function of model size and training\-token budget\. In the abundant\-data regime, the processed tokens can be treated as fresh samples, so the distinction between unique tokens and repeated tokens is not explicit\. The standard compute\-allocation problem is

\(N⋆\(C\),D⋆\(C\)\)=argminN,DLeval\(N,D\)s\.t\.C\(N,D\)=C\.\(N^\{\\star\}\(C\),D^\{\\star\}\(C\)\)=\\arg\\min\_\{N,D\}L\_\{\\mathrm\{eval\}\}\(N,D\)\\quad\\mathrm\{s\.t\.\}\\quad C\(N,D\)=C\.For example, Chinchilla\-style parametric scaling writesL^\(N,D\)=E\+AN−α\+BD−β,\\widehat\{L\}\(N,D\)=E\+AN^\{\-\\alpha\}\+BD^\{\-\\beta\},and then chooses the point on this surface that minimizes loss under the training\-compute constraint\. Such laws are highly effective when new data is available, but they do not distinguish a token budgetDDconsisting of fresh tokens from the same budget obtained by repeatedly training on a finite corpus\.

In data\-constrained, compute\-rich pretraining, the unique\-token budgetUUis fixed or bounded, andCCis unbounded\. Additional training compute can be spent by increasing the number of epochs, increasing model size, or changing regularization\. Prior work studies several versions of this problem\.Muennighoffet al\.\([2023](https://arxiv.org/html/2606.06888#bib.bib23)\)model repeated data under compute constraints by replacing raw token and parameter counts with effective resources that saturate as repetitions and excess parameters grow\.Kimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)study a more compute\-rich setting in which the unique data is fixed and the training recipe is tuned to estimate the best attainable loss at each model scale\.

We follow the compute\-rich perspective\. For a fixed architecture family, optimizer class, data distribution, and evaluation protocol, define the optimized validation\-loss envelope

L⋆\(N,U\)=infh∈ℋLeval\(N,U;h\),L^\{\\star\}\(N,U\)=\\inf\_\{h\\in\\mathcal\{H\}\}L\_\{\\mathrm\{eval\}\}\(N,U;h\),wherehhincludes the tunable training hyperparameters, such as the number of epochs, learning\-rate schedule, weight decay, and other regularization choices\. In this formulation,D=UNE\(h\)D=UN\_\{E\}\(h\)determines the compute used by a particular training run, but compute is not the binding constraint used to defineL⋆L^\{\\star\}\. The goal is therefore to model the joint dependence of the best\-achievable loss on model sizeNNand unique data sizeUU\.

### 2\.2Autoregressive and Masked Diffusion Language Models

Letpθp\_\{\\theta\}denote the transformer model and\{xi\}i=1n\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}the training dataset, where each samplexi=\[xi,0,xi,1,…,xi,T−1\]x\_\{i\}=\[x\_\{i,0\},x\_\{i,1\},\\dots,x\_\{i,T\-1\}\]is a sequence of lengthTT\. Autoregressive models predict tokens from left to right\. The training objectiveℒNTP\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}is−∑i=1n∑t=0T−1log⁡pθ\(xi,t\|xi,<t\)/\(nT\)\.\-\\sum\_\{i=1\}^\{n\}\\sum\_\{t=0\}^\{T\-1\}\\log p\_\{\\theta\}\(x\_\{i,t\}\\,\|\\,x\_\{i,<t\}\)/\(nT\)\.For each sequencexix\_\{i\}, dLLMs sample a mask ratiori∼Unif\(0,1\]r\_\{i\}\\sim\\text\{Unif\}\(0,1\], and use a Bernoulli random variableBern\(ri\)\\text\{Bern\}\(r\_\{i\}\)to decide whether to mask the tokenxi,tx\_\{i,t\}or not for each positiont∈\[0,T\)t\\in\[0,T\)\. The model only predicts the true tokens at those masked positions\. The training objective is

−1nT∑i=1n\[1ri∑t=0T−1𝕀\(x~i,t=MASK\)log⁡pθ\(xi,t\|x~i\)\],\-\\frac\{1\}\{nT\}\\sum\_\{i=1\}^\{n\}\\Big\[\\frac\{1\}\{r\_\{i\}\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{I\}\(\\widetilde\{x\}\_\{i,t\}=\\text\{MASK\}\)\\log p\_\{\\theta\}\(x\_\{i,t\}\\,\|\\,\\widetilde\{x\}\_\{i\}\)\\Big\],wherex~i\\widetilde\{x\}\_\{i\}represents the masked samplexix\_\{i\}\.

![Refer to caption](https://arxiv.org/html/2606.06888v1/x3.png)Figure 2:Validation Loss dynamics on DCLM 100M for the 257M model\. Large weight decay substantially improves both multi\-epoch AR and dLLM training; with both well regularized, their validation losses become comparable\.

## 3Regularization in the Data\-Constrained, Compute\-Rich Regime

### 3\.1Weight Decay Transfers Across AR and dLLM Pretraining

Recent studies report that dLLMs outperform AR models in the data\-constrained regime\(Niet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib35); Prabhudesaiet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib32)\), using weight decaywd=0\.1\\mathrm\{wd\}=0\.1for both\. Independently,Kimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)showed that large weight decay is critical for AR pretraining in this regime\. We ask whether this benefit transfers to dLLMs and re\-examine the AR–dLLM comparison under matched large\-weight\-decay treatment\. On DCLM with 100M unique tokens, we compare four recipes at three model sizes \(140M, 257M, and 664M\): \(i\) Multi\-epoch AR \(wd=0\.1\\mathrm\{wd\}=0\.1, tuned epochs\); \(ii\) Multi\-epoch dLLM \(wd=0\.1\\mathrm\{wd\}=0\.1, perPrabhudesaiet al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib32)\)\); \(iii\) Strongly Regularized AR with epochs, learning rate, and weight decay jointly tuned followingKimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\); \(iv\) Strongly Regularized dLLM, which inherits the AR\-tuned weight decay but keeps other hyperparameters atPrabhudesaiet al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib32)\)defaults\. We report final\-step validation loss for AR and the best across\-epoch loss for dLLM\.

Figure[2](https://arxiv.org/html/2606.06888#S2.F2)shows all four recipes at 257M\. Withwd=0\.1\\mathrm\{wd\}=0\.1, we reproduce the finding that dLLM \(3\.603\.60\) outperforms multi\-epoch AR \(3\.883\.88\)\. Large weight decay dramatically improves both: it reduces AR loss to3\.423\.42and, when ported to dLLM, reduces dLLM loss to3\.483\.48\. Since dLLM validation loss is the negative evidence lower bound \(an upper bound on the negative log\-likelihood\) while AR loss is exact negative log\-likelihood, the slightly higher dLLM validation loss does not imply worse performance\. The two strongly regularized recipes have losses comparable at 140M, 257M, and 664M \(see Table[4](https://arxiv.org/html/2606.06888#A1.T4)in Appendix[A](https://arxiv.org/html/2606.06888#A1)\), implying that the previously reported AR–dLLM gap is largely explained by insufficient AR regularization\. Still, the fact that dLLMs avoid the repeated\-epoch collapse seen in weakly regularized AR suggests that random input masking acts as an implicit regularizer in its own right\. We next ask whether this masking signal can contribute additional gains on top of strong weight decay when added to AR training\.

### 3\.2Masked Input Regularization

To capture this hypothesized benefit without abandoning the efficiency of standard AR decoding, we study masked\-input regularization \(MIR\)\. The method samples a mask ratiorrfrom a uniform distributionUnif\(rmin,rmax\)\\text\{Unif\}\(r\_\{\\min\},r\_\{\\max\}\)for each input sequencexx\. At each positiont∈\[0,T−1\]t\\in\[0,T\-1\], a Bernoulli random variable with success probabilityrrdetermines whether to replace the tokenxtx\_\{t\}with a specialized \[MASK\] token\. Letx~\\widetilde\{x\}denote this corrupted sequence\. Without altering the model architecture, MIR adds an auxiliary next\-token prediction loss on the masked sequence:

ℒ=ℒNTP\(x\)\+λℒNTP\(x~\)\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(x\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(\\widetilde\{x\}\)\.It requires two forward passes to calculate the training loss for each batch\. MIR therefore increases per\-step training compute\. Because our focus is the data\-constrained, compute\-rich regime, we use MIR to study whether additional compute can improve loss at a fixed unique\-data budget, rather than as a compute\-efficiency method\. See tuning details and regularization coefficient in Appendix[A\.7](https://arxiv.org/html/2606.06888#A1.SS7)\.

### 3\.3Theoretical Intuition: Reducing Memorization via Masking

We provide intuition for how masking improves validation loss by analyzing a toy context\-specific noise model in Appendix[C](https://arxiv.org/html/2606.06888#A3)\. This model decomposes each sequence into three parts: a context\-specific component that enables memorization and acts as noise for generalization, a generalizable component that contains predictive features, and an output token to be predicted from the first two components\. Under the data\-constrained, compute\-rich regime, we establish the following dynamic\.

Theorem \(Informal\)\.Under the context\-specific noise model in the data\-constrained, compute\-rich regime, standard autoregressive pretraining can minimize training loss by relying almost entirely on the context\-specific component, thereby memorizing patterns that do not generalize to unseen examples\. In contrast, MIR regularizes the model’s dependence on the context\-specific components and encourages it to learn predictive patterns on the generalizable components, thereby strictly improving validation loss\. Moreover, for a fixed data size, this improvement increases as model capacity grows\.

This informal theorem illustrates that MIR improves validation loss by reducing the model’s dependence on context\-specific noise and encouraging it to learn generalizable predictive features\. We provide the formal model definition, assumptions, theorem statement, and proofs in Appendix[C](https://arxiv.org/html/2606.06888#A3)\.

### 3\.4Empirical Results

We evaluate MIR against the strongly regularized AR model baseline along four axes: scaling behavior on natural language, whether the gain transfers to coding data, where the gain comes from, and whether it translates to downstream tasks\. Throughout, the only difference between MIR and the baseline is the auxiliary masked\-input loss; architecture, optimizer, and the per\-cell\-tuned\(epochs,weight decay,learning rate\)\(\\text\{epochs\},\\text\{weight decay\},\\text\{learning rate\}\)configuration are held fixed across the two recipes\.

We find the following\. \(1\) On DCLM with 100M unique tokens, MIR reduces validation loss at every model scale from 72M to 1\.4B and on every matched random seed, with the average gain growing from roughly0\.0060\.006at 72M to about0\.030\.03at 1\.4B\. This trend is consistent with the theoretical prediction in Section[3\.3](https://arxiv.org/html/2606.06888#S3.SS3)that overparameterized models benefit more from masking\-based regularization\. \(2\) The benefit is not specific to natural language: with hyperparameters tuned only on DCLM, MIR also reduces validation loss at all five model sizes on the code\-heavy Stack\-V2 dataset\. \(3\) A token\-level analysis on the 1\.4B model shows that MIR’s gain comes from a broad set of validation positions rather than a few outliers\. At the positions where MIR most outperforms the baseline, the true next token is itself usually a common one such as a function word or punctuation, and what makes prediction difficult is the preceding context: rare names, mixed scripts, broken word pieces, or noisy web text \(Section[3\.4](https://arxiv.org/html/2606.06888#S3.SS4)\)\. \(4\) The loss improvement directionally transfers to downstream tasks: the 1\.4B MIR model outperforms the strongly regularized baseline on six of eight zero\-shot metrics, including\+10\.2\+10\.2points on BoolQ and\+2\.2\+2\.2points on SciQ\.

Experimental setup\.To evaluate masked\-input regularization, we train models on two distinct data distributions: standard natural language from DCLM\(Liet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib18)\)and code\-heavy text from Stack\-V2\(Lozhkovet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib19)\)\. For both datasets, the pretraining budget is fixed to 100M unique seed tokens and 10M tokens are reserved for validation\. We tune hyperparameters only on DCLM data and test whether the improvement from MIR still exists on the Stack\-V2 dataset\.

We build a scaling ladder with five model sizes:

ScalingLadder\(k\)=\(kW1,kL1,S1,B1\),\\text\{ScalingLadder\}\(k\)=\(kW\_\{1\},kL\_\{1\},S\_\{1\},B\_\{1\}\),whereW1=1024W\_\{1\}=1024is the embedding dimension whenk=1k=1,L1=12L\_\{1\}=12is the number of layers whenk=1k=1,S1=2048S\_\{1\}=2048is the sequence length,B1=128B\_\{1\}=128is the total batch size, andk∈\{0\.5,0\.75,1,1\.5,2\}k\\in\\\{0\.5,0\.75,1,1\.5,2\\\}\. Across the scaling ladder, the attention head dimension is fixed at 64, while the depth, embedding dimension, MLP dimension, and number of attention heads increase with scale\. The model size ranges from7272M to1\.41\.4B\. We use a Llama\-style decoder\-only transformer and use the same model architecture for all experiments\. AdamW optimizer is used for all experiments\. We use grid search to select the number of training steps, learning rate, and weight decay for each model\. See Appendix[A](https://arxiv.org/html/2606.06888#A1)for details on the optimizer, model architecture, and hyperparameter search\.

Validation loss improvements\.Figure[1\(a\)](https://arxiv.org/html/2606.06888#S1.F1.sf1)visualizes validation loss across the scaling ladder for the DCLM 100M dataset, averaged over five random seeds\. MIR improves validation loss over the strongly regularized baseline for every matched seed at every model scale\. On the 1\.4B parameter model, for example, MIR reduces the mean validation loss from 3\.347 to 3\.317\. The average gain grows from roughly 0\.006 loss at 72M parameters to about 0\.03 loss for the two largest models, suggesting that MIR is especially useful when model capacity is high relative to the amount of unique training data\. This trend is qualitatively consistent with our theoretical analysis, which predicts that larger overparameterized models are more prone to overfitting, and therefore benefit more from masking\-based regularization\.

Crucially, this regularization benefit generalizes beyond standard natural language\. We repeat the same 100M token experiments to evaluate performance on code\-heavy data: on Stack\-V2, MIR reduces validation loss at all five model sizes, with absolute gains from 0\.008 to 0\.020 loss; see full numbers in Table[8](https://arxiv.org/html/2606.06888#A1.T8)in the Appendix\.

Where MIR helps: Token\-level analysis\.To localize where the validation\-loss gain comes from, we compare the 1\.4B regularized baseline and the 1\.4B MIR model on the1010M DCLM eval dataset\. For each positiontt, we compute the negative log\-likelihood on the true targetyt=xt\+1y\_\{t\}=x\_\{t\+1\}and define the token\-level loss gap asΔℓt=ℓbase\(t\)−ℓMIR\(t\),\\Delta\\ell\_\{t\}=\\ell\_\{\\mathrm\{base\}\}\(t\)\-\\ell\_\{\\mathrm\{MIR\}\}\(t\),so that positive values favor MIR\. Figure[3](https://arxiv.org/html/2606.06888#S3.F3)Left shows that the MIR\-better tail is both larger and slightly heavier than the baseline\-better tail after removing the center region\|Δℓt\|<1\|\\Delta\\ell\_\{t\}\|<1:6\.61%6\.61\\%of tokens satisfyΔℓt≥1\\Delta\\ell\_\{t\}\\geq 1, while5\.41%5\.41\\%satisfyΔℓt≤−1\\Delta\\ell\_\{t\}\\leq\-1\. Therefore, the overall loss gain is not driven by a few isolated outliers, but appears on a broad set of hard validation tokens\.

The top positive\-gap tokens reveal a clear qualitative pattern\. We rank all validation positions byΔℓt\\Delta\\ell\_\{t\}, decode the top0\.1%0\.1\\%MIR\-better positions, and inspect the true token together with the preceding and following tokens\. These high\-gap examples are dominated by continuation problems rather than standalone rare targets:62\.6%62\.6\\%are word or subword continuations,16\.3%16\.3\\%occur in non\-English or transliterated text, and11\.6%11\.6\\%are punctuation tokens\. Importantly, the true token is often a common token such as “and”, “to”, “of”, “is”, a comma, or a closing parenthesis\. What makes these positions hard to predict is the local prefix context: non\-English languages, rare names, mixed scripts, broken word pieces, or noisy web and markup text\.

![Refer to caption](https://arxiv.org/html/2606.06888v1/x4.png)
![Refer to caption](https://arxiv.org/html/2606.06888v1/x5.png)

Figure 3:Left: Absolute token\-level loss\-gap tails on all validation tokens for the 1\.4B models after removing the center region\|Δℓ\|<1\|\\Delta\\ell\|<1\. The positive tail, where MIR assigns higher probability to the true next token than the strongly regularized baseline, is both larger and slightly heavier\. Right: Representative MIR\-better tokens from the top0\.1%0\.1\\%positive\-gap set\. In each example, the target next token is highlighted in red, and the probabilities assigned to that token by MIR and by the strongly regularized baseline are shown below\. Many large\-gap cases involve names, subword completions, mixed scripts, or noisy web and technical text, even when the true token itself is common\.Representative examples illustrate how these gains arise in practice\. Figure[3](https://arxiv.org/html/2606.06888#S3.F3)\(right\) shows cases where the baseline falls back to a generic continuation or keeps following the wrong local pattern, whereas MIR recovers the intended continuation\. One example is a mixed\-script entity name followed by a Japanese parenthetical gloss, where the correct next token is the closing parenthesis “\)”; MIR predicts it correctly, while the baseline keeps extending the Japanese string\. Taken together, these results suggest that MIR helps most when the next\-token decision depends on robustness to unusual or noisy local prefix context rather than on simple frequency\-based continuation\. This pattern is consistent with our theoretical intuition that masking regularizes the model’s dependence on irrelevant details in the prefix context and encourages it to learn predictive features that generalize across contexts\.

Downstream evaluations\.To understand whether the improved validation loss translates to capability gains on downstream tasks, we evaluate the two 1\.4B models trained on DCLM dataset withU=U=100M across a suite of downstream tasks usinglm\-evaluation\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib2)\)\.

Table[1](https://arxiv.org/html/2606.06888#S3.T1)shows that the MIR\-trained model achieves superior performance on six of the eight evaluated metrics\. The improvements are particularly pronounced on reasoning and reading comprehension tasks, pushing accuracy on BoolQ up by 10\.18 percentage points and SciQ up by 2\.20 percentage points compared to the strongly regularized baseline\. Because these models are trained at academic scale, with 1\.4B parameters and 100M unique training tokens, we view the downstream evaluations as a coarse capability check: MIR shows a large gain on BoolQ and smaller mixed changes elsewhere, with the overall pattern directionally agreeing with its validation\-loss improvement\.

Table 1:Downstream zero\-shot evaluation for 1\.4B models trained on the DCLM data withU=U=100M\.TaskRandom GuessRegularized Baseline\+ MIRARC\-Easy \(acc\_norm\)0\.25000\.3805±\\pm0\.01000\.3893±\\pm0\.0100BoolQ \(acc\)0\.50000\.4511±\\pm0\.00870\.5529±\\pm0\.0087HellaSwag \(acc\_norm\)0\.25000\.2833±\\pm0\.00450\.2855±\\pm0\.0045PiQA \(acc\_norm\)0\.50000\.5996±\\pm0\.01140\.5985±\\pm0\.0114RACE \(acc\)0\.25000\.2689±\\pm0\.01370\.2766±\\pm0\.0138SciQ \(acc\_norm\)0\.25000\.5780±\\pm0\.01560\.6000±\\pm0\.0155Lambada \(acc\)∼\\sim0\.00000\.2271±\\pm0\.00580\.2261±\\pm0\.0058Lambada \(perplexity\)N/A112\.8966±\\pm4\.9091106\.7115±\\pm4\.5752

## 4Scaling in the Data\-Constrained, Compute\-Rich Regime

In this section, we extend the experiments to a five\-by\-four grid of model sizes by unique\-data budgets and use it to \(a\) show that the classical Chinchilla scaling law is misspecified in the data\-constrained, compute\-rich regime, \(b\) propose the SoftQ scaling law as a better\-fitting alternative, and \(c\) quantify MIR’s data efficiency gain over the strongly regularized baseline\.

Constructing the baseline grid\.We choose three additional unique\-data budgets beyond the 100M used in Section[3\.4](https://arxiv.org/html/2606.06888#S3.SS4): 200M, 300M, and 400M\. The grid is thus five model sizes×\\timesfour data sizes:\{72M,140M,257M,664M,1\.4B\}×\{100M,200M,300M,400M\}\\\{72\\text\{M\},140\\text\{M\},257\\text\{M\},664\\text\{M\},1\.4\\text\{B\}\\\}\\times\\\{100\\text\{M\},200\\text\{M\},300\\text\{M\},400\\text\{M\}\\\}\. For each cell, we tune the number of epochs, weight decay, and learning rate; the optimal weight decay is consistently much larger than the standard value of 0\.1, so followingKimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)we call this the strongly regularized recipe\. See Appendix[A](https://arxiv.org/html/2606.06888#A1)for the hyperparameter search and best configurations\. The result is a baseline dataset\{\(N,U,L\)\}\\\{\(N,U,L\)\\\}of 20 points, whereLLis the validation loss of the AR model of sizeNNin the scaling ladder trained onUUunique tokens with the best hyperparameters\(NE,weight decay,learning rate\)\(N\_\{E\},\\text\{weight decay\},\\text\{learning rate\}\)for that cell\.

### 4\.1The SoftQ scaling law

Why Chinchilla is Misspecified\.The Chinchilla scaling law decomposes loss into irreducible entropy, finite\-parameter error, and finite\-data error:

LCh\(N,U\)=E\+ANα\+BUβ\.L\_\{\\mathrm\{Ch\}\}\(N,U\)=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\\frac\{B\}\{U^\{\\beta\}\}\.\(1\)Its additive structure implies that the parameter and data terms are separable\. Consequently, given a model with sizeNN, the loss gap between two unique data budgetsU1,U2U\_\{1\},U\_\{2\}does not depend onNN:

LCh\(N,U1\)−LCh\(N,U2\)=BU1β−BU2β\.L\_\{\\mathrm\{Ch\}\}\(N,U\_\{1\}\)\-L\_\{\\mathrm\{Ch\}\}\(N,U\_\{2\}\)=\\frac\{B\}\{U\_\{1\}^\{\\beta\}\}\-\\frac\{B\}\{U\_\{2\}^\{\\beta\}\}\.This prediction is at odds with the expected behavior in data\-constrained, compute\-rich pretraining\. The marginal value of additional unique data should depend on model size\. For sufficiently small models, bothU1U\_\{1\}andU2U\_\{2\}provide more unique information than the model can effectively exploit, so the losses obtained from the two data budgets should be similar\. In this regime, the loss gap should be close to zero\. For sufficiently large models, capacity is no longer the binding constraint, and the difference in available unique information betweenU1U\_\{1\}andU2U\_\{2\}should become visible in validation loss\. The gap should therefore increase withNN, reflecting a coupling between model size and unique data budget that Chinchilla’s additive form cannot represent\.

We verify this behavior empirically\. Figure[1\(b\)](https://arxiv.org/html/2606.06888#S1.F1.sf2)shows the diagnostic directly: the loss gap between each smaller data budget and the 400M budget increases with model size, but Chinchilla predicts a constant gap for each budget\. This motivates a coupled law rather than an additive one\.

Existing coupled laws\.Two prior laws have moved in this direction\.Muennighoffet al\.\([2023](https://arxiv.org/html/2606.06888#bib.bib23)\)generalize the Chinchilla law by replacing raw data and parameter counts with effective model sizeN′N^\{\\prime\}and effective data sizeD′D^\{\\prime\}that saturate under repeated data and excess parameters\. It includes the number of epochsNEN\_\{E\}as an additional input to predict the validation loss:

LM\(N,U,NE\)=E\+A\(N′\)α\+B\(D′\)β,D′=f\(U,NE\),N′=g\(N,U,NE\),\\displaystyle L\_\{\\mathrm\{M\}\}\(N,U,N\_\{E\}\)=E\+\\frac\{A\}\{\(N^\{\\prime\}\)^\{\\alpha\}\}\+\\frac\{B\}\{\(D^\{\\prime\}\)^\{\\beta\}\},\\quad D^\{\\prime\}=f\(U,N\_\{E\}\),\\ N^\{\\prime\}=g\(N,U,N\_\{E\}\),\(2\)which has seven parameters to fit\.Michaud \([2026](https://arxiv.org/html/2606.06888#bib.bib16)\)derive a scaling law from the quanta\-skill learning model\. They assume that the use frequencies of skills follow a power law and obtainL\(N,U\)−E∝n\(N,U\)−αL\(N,U\)\-E\\propto n\(N,U\)^\{\-\\alpha\}, wheren\(N,U\)n\(N,U\)is the number of skills the model can learn givenNNparameters andUUunique tokens\. Under further assumptions, they shown\(N,U\)∝Nn\(N,U\)\\propto NwhenU→∞U\\rightarrow\\inftyandn\(N,U\)∝U1/\(1\+α\)n\(N,U\)\\propto U^\{1/\(1\+\\alpha\)\}whenN→∞N\\rightarrow\\infty\. Concurrently,Merrillet al\.\([2026](https://arxiv.org/html/2606.06888#bib.bib1)\)proposed Expressivity\-Aware Scaling Laws, which derived the same scaling properties\. Settingn\(N,U\)=\(A/N\+B/U1/\(1\+α\)\)−1n\(N,U\)=\\big\(A/N\+B/U^\{1/\(1\+\\alpha\)\}\\big\)^\{\-1\}yields the Quanta scaling law:

LQ\(N,U\)=E\+\(AN\+BU1/\(1\+α\)\)α,L\_\{\\mathrm\{Q\}\}\(N,U\)=E\+\\left\(\\frac\{A\}\{N\}\+\\frac\{B\}\{U^\{1/\(1\+\\alpha\)\}\}\\right\)^\{\\alpha\},\(3\)where the marginal value of increasing model size depends on the available data through the outer exponent\. We give the full expression of Muennighoff law in Appendix[B](https://arxiv.org/html/2606.06888#A2)and the detailed Quanta derivation in Appendix[D](https://arxiv.org/html/2606.06888#A4)\.

SoftQ\.Motivated by the skill\-learning view of scaling, we propose the*SoftQ scaling law*, a soft\-quanta law that combines the parameter\-limited and data\-limited regimes through a smooth bottleneck:

LSoftQ\(N,U\)=E\+\(ANρ\+BUρ/\(1\+α\)\)α/ρ\.L\_\{\\mathrm\{SoftQ\}\}\(N,U\)=E\+\\left\(\\frac\{A\}\{N^\{\\rho\}\}\+\\frac\{B\}\{U^\{\\rho/\(1\+\\alpha\)\}\}\\right\)^\{\\alpha/\\rho\}\.\(4\)The parameterρ\\rhocontrols the sharpness of the transition between the parameter\-limited and data\-limited regimes\. AsU→∞U\\rightarrow\\infty, the law recovers a parameter\-scaling limitL−E∝N−αL\-E\\propto N^\{\-\\alpha\}; asN→∞N\\rightarrow\\infty, it recovers a data\-scaling limitL−E∝U−α/\(1\+α\)L\-E\\propto U^\{\-\\alpha/\(1\+\\alpha\)\}\. It has five fitted parameters,\{A,B,E,α,ρ\}\\\{A,B,E,\\alpha,\\rho\\\}, matching the Chinchilla parameter count while explicitly coupling model size and data size\. Whenρ=1\\rho=1, SoftQ reduces to the Quanta law, so SoftQ strictly nests Quanta as a special case while adding one parameter that controls the bottleneck sharpness\.

### 4\.2Scaling Laws Comparison and MIR data efficiency

We compare Chinchilla, Quanta, Muennighoff, and SoftQ on three diagnostics: \(1\) full fit on the strongly regularized baseline results; \(2\) held\-out fit, training on the 100M/200M/300M points and predicting the five 400M points; and \(3\) full fit on an independent baseline dataset provided byKimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)\. For the fitting protocol, Chinchilla, Quanta, and SoftQ use the Approach\-3\-style objective ofHoffmannet al\.\([2022](https://arxiv.org/html/2606.06888#bib.bib29)\): Huber loss with thresholdδ=10−3\\delta=10^\{\-3\}on log\-loss residuals\. For the Muennighoff\-style law, our main comparison uses a dataset\-adapted two\-stage protocol: fit the base Chinchilla coefficients on the same split, then hold them fixed while fitting only the decay constantsRN⋆R\_\{N\}^\{\\star\}andRD⋆R\_\{D\}^\{\\star\}\. We report RMSE and MAE on the raw validation\-loss scale, and an SSE\-based Gaussian AIC:nlog⁡\(RSS/n\)\+2k,n\\log\(\\mathrm\{RSS\}/n\)\+2k,wherekkis the number of fitted parameters\.

Table 2:Scaling laws comparison results\. Lower is better\.Full fitHeld\-out400400M\[Kim et al\.\] Full fitLawkkRMSEMAEAICRMSEMAERMSEAICChinchilla50\.026530\.01802\-135\.180\.031060\.025400\.04041\-92\.68Quanta40\.012520\.00889\-167\.230\.014970\.012070\.02375\-111\.69Muennighoff70\.023350\.01713\-136\.290\.032520\.027110\.03299\-95\.17SoftQ50\.008010\.00520\-183\.060\.005950\.004710\.00785\-145\.10Table[2](https://arxiv.org/html/2606.06888#S4.T2)shows that SoftQ is the strongest baseline law across all three diagnostics\. It gives the best in\-sample fit on the full baseline dataset, the best data\-axis extrapolation to the held\-out400400M budget, and the best fit on the external scaling law datasets fromKimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)\. The held\-out result is especially important for data\-efficiency estimation, because we later ask how much additional unique data the regularized baseline would need to match an MIR asymptote\. Figure[1\(b\)](https://arxiv.org/html/2606.06888#S1.F1.sf2)visualizes this fit: SoftQ reproduces the empirical fan\-out across data budgets that Chinchilla cannot\. Eq\. \([14](https://arxiv.org/html/2606.06888#A2.E14)\) in Appendix[B](https://arxiv.org/html/2606.06888#A2)gives its full expression\.

We train the model with MIR on the same grid\. For each unique\-token budget, we fit the parameter\-scaling asymptoteLMIR,U\(N\)=EU\+AU/NαUL\_\{\\mathrm\{MIR\},U\}\(N\)=E\_\{U\}\+A\_\{U\}/N^\{\\alpha\_\{U\}\}using the five model sizes at that budget\. See Figure[8](https://arxiv.org/html/2606.06888#A2.F8)for the fitted curves\. TakingN→∞N\\to\\inftyin Eq\. \([14](https://arxiv.org/html/2606.06888#A2.E14)\) gives the regularized\-baseline infinite\-model curveLReg,∞\(U\)=0\.306\+2\.249U−0\.125\.L\_\{\\mathrm\{Reg\},\\infty\}\(U\)=0\.306\+2\.249\\,U^\{\-0\.125\}\.For each MIR asymptoteEUE\_\{U\}, we solveLReg,∞\(Ueq\)=EUL\_\{\\mathrm\{Reg\},\\infty\}\(U\_\{\\mathrm\{eq\}\}\)=E\_\{U\}and reportUeq/UMIRU\_\{\\mathrm\{eq\}\}/U\_\{\\mathrm\{MIR\}\}\. Under this baseline law, MIR consistently improves unique\-data efficiency: at200200M–400400M unique tokens, the regularized baseline would need about1\.281\.28–1\.34×1\.34\\timesas much unique data to match the MIR infinite\-model asymptote\. For completeness, we also use the other three scaling laws to calculate the data efficiency ratios\. SoftQ gives the most conservative data efficiency ratio atU=U=400M among all scaling laws\. See Appendix[B\.6](https://arxiv.org/html/2606.06888#A2.SS6)for details\.

## 5Related Work

#### Classical Scaling Laws

Empirical scaling laws have provided a central tool for predicting language\-model loss as a function of model size, data, and compute\.Hestnesset al\.\([2017](https://arxiv.org/html/2606.06888#bib.bib8)\); Rosenfeldet al\.\([2020](https://arxiv.org/html/2606.06888#bib.bib9)\)found that deep\-learning generalization curves often follow power laws across model and dataset scales\. For language modeling,Kaplanet al\.\([2020](https://arxiv.org/html/2606.06888#bib.bib30)\)showed that cross\-entropy loss scales predictably with parameter count, dataset size, and training compute\.Henighanet al\.\([2020](https://arxiv.org/html/2606.06888#bib.bib10)\)extended similar power\-law behavior to other autoregressive generative domains\.Hoffmannet al\.\([2022](https://arxiv.org/html/2606.06888#bib.bib29)\)revised the compute\-optimal allocation problem and argued that model size and training tokens should be increased at comparable rates, leading to the Chinchilla recipe\. These laws are highly effective in the abundant\-data setting, but they typically treat processed tokens as fresh samples and therefore do not explicitly distinguish unique data from repeated epochs\. This distinction becomes important once the available corpus size, rather than compute, becomes the binding resource\.

#### Data\-constrained Pretraining

Muennighoffet al\.\([2023](https://arxiv.org/html/2606.06888#bib.bib23)\)studied repeated\-data training and proposed effective\-resource scaling laws that account for the diminishing value of repeated tokens and excess parameters; they found that modest repetition can be close to fresh data, but that the marginal value of repetition eventually decays\.Kimet al\.\([2026b](https://arxiv.org/html/2606.06888#bib.bib31)\)sharpened this into an infinite\-compute, fixed\-data viewpoint, showing that simply increasing epochs and parameters can overfit, and that much stronger regularization, especially substantially larger weight decay than standard practice, can improve the best attainable loss\. Recent work has also explored data\-side and benchmark\-driven approaches to this regime\.Kimet al\.\([2026a](https://arxiv.org/html/2606.06888#bib.bib11)\)generate document\-level synthetic rephrases and show that scaling these generations improves validation loss on web text\. The NanoGPT Slowrun benchmark\(Q Labs,[2026](https://arxiv.org/html/2606.06888#bib.bib17)\)similarly operationalizes fixed\-data, high\-compute language modeling by fixing 100M FineWeb\(Penedoet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib33)\)tokens and ranking methods by validation loss with no compute limit\.

#### Masked Diffusion Language Model

Discrete and masked diffusion language models provide an alternative to left\-to\-right factorization by corrupting tokens and learning to reverse the corruption process\.Sahooet al\.\([2024](https://arxiv.org/html/2606.06888#bib.bib38)\)proposed masked diffusion language models with effective training recipes\.Nieet al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib41)\); Bieet al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib45)\)scaled up the model and data size to train large\-scale diffusion language models\. In the data\-constrained setting,Prabhudesaiet al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib32)\)andNiet al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib35)\)report that masked diffusion models can outperform autoregressive models under repeated\-data training, attributing the gains to factors such as any\-order prediction, dense denoising supervision, and implicit Monte Carlo augmentation\.

#### Masking, noising, and denoising objectives

Training on corrupted inputs has a long history as a regularization and representation\-learning principle\. In NLP, BERT popularized masked language modeling for bidirectional representation learning\(Devlinet al\.,[2019](https://arxiv.org/html/2606.06888#bib.bib3)\), while BART and T5 extended masking and denoising ideas to sequence\-to\-sequence pretraining through masked\-span reconstruction, arbitrary text corruption, and span corruption\(Lewiset al\.,[2020](https://arxiv.org/html/2606.06888#bib.bib4); Raffelet al\.,[2020](https://arxiv.org/html/2606.06888#bib.bib5)\)\. These objectives use masking as the main pretraining task and often change the architecture or inference interface relative to decoder\-only autoregressive language modeling\. MIR instead keeps the standard causal next\-token objective and autoregressive decoding, using masking only as an auxiliary input perturbation during training\. Several works are closer to MIR in spirit because they inject masking or dropout into autoregressive or limited\-data training\.Zhanget al\.\([2020](https://arxiv.org/html/2606.06888#bib.bib6)\)used word and token dropout as data augmentation and regularization in sequence modeling\.Zhuanget al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib7)\)proposed Mask\-Enhanced Autoregressive Prediction, which masks a small fraction of input tokens and then performs standard next\-token prediction to improve retrieval and long\-context behavior\.Wanget al\.\([2025](https://arxiv.org/html/2606.06888#bib.bib12)\)masked low\-entropy tokens to regularize multi\-epoch training on limited domain data\. We differ in three respects: MIR pairs the masked\-input loss with the clean next\-token loss rather than replacing clean training, it studies random masking as a general pretraining regularizer in the data\-constrained, compute\-rich regime, and we quantify the resulting unique\-data efficiency through fitted scaling laws\.

## 6Discussion

We study data\-constrained, compute\-rich pretraining along two axes, regularization and scaling\. First, large weight decay substantially reduces dLLM validation loss; MIR, an auxiliary next\-token loss on randomly masked inputs, further improves AR model validation loss on top of large weight decay across 72M to 1\.4B parameters\. Second, the additive Chinchilla law is misspecified in this regime because it decouples model and data size; we propose the SoftQ scaling law, which couples them and fits both our experiments and an independent grid from prior work better than existing alternatives\. Our study has several limitations\. Experiments span up to 1\.4B parameters and 400M unique tokens, small relative to frontier\-scale pretraining\. We held model architecture and optimizer fixed; varying these could yield further gains\. Our protocol also relies on heavy per\-cell hyperparameter search; a hyperparameter\-transfer recipe for this regime is a natural next step\.

## 7Acknowledgments

We thank Eric Czech, Hrayr Harutyunyan, and Samip Dahal for helpful discussions and their invaluable feedback\. This work was supported in part by funding from the DARPA AIQ program, the Office of Naval Research under grant N00014\-23\-1\-2590, the National Science Foundation under grant No\. 2310831, No\. 2428059, No\. 2435696, No\. 2440954, a Michigan Institute for Data Science Propelling Original Data Science \(PODS\) grant, Two Sigma Investments LP, and LG Management Development Institute AI Research\.

## References

- T\. Bie, M\. Cao, K\. Chen, L\. Du, M\. Gong, Z\. Gong, Y\. Gu, J\. Hu, Z\. Huang, Z\. Lan,et al\.\(2025\)Llada2\.0: scaling up diffusion language models to 100b\.arXiv preprint arXiv:2512\.15745\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px3.p1.1)\.
- Common Crawl \(2025\)Statistics of Common Crawl Monthly Archives: Crawl Size\.Note:Accessed: 2026\-04\-28External Links:[Link](https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize)Cited by:[§1](https://arxiv.org/html/2606.06888#S1.p2.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px4.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§3\.4](https://arxiv.org/html/2606.06888#S3.SS4.p10.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1)\.
- T\. Henighan, J\. Kaplan, M\. Katz, M\. Chen, C\. Hesse, J\. Jackson, H\. Jun, T\. B\. Brown, P\. Dhariwal, S\. Gray,et al\.\(2020\)Scaling laws for autoregressive generative modeling\.arXiv preprint arXiv:2010\.14701\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Hestness, S\. Narang, N\. Ardalani, G\. Diamos, H\. Jun, H\. Kianinejad, M\. M\. A\. Patwary, Y\. Yang, and Y\. Zhou \(2017\)Deep learning scaling is predictable, empirically\.arXiv preprint arXiv:1712\.00409\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.1555610\.Cited by:[§B\.1](https://arxiv.org/html/2606.06888#A2.SS1.p2.1),[§1](https://arxiv.org/html/2606.06888#S1.p1.1),[§1](https://arxiv.org/html/2606.06888#S1.p8.1),[§2\.1](https://arxiv.org/html/2606.06888#S2.SS1.p2.3),[§4\.2](https://arxiv.org/html/2606.06888#S4.SS2.p1.5),[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Hu, Y\. Tu, X\. Han, G\. Cui, C\. He, W\. Zhao, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, X\. Zhang, Z\. L\. Thai, C\. Wang, Y\. Yao, C\. Zhao, J\. Zhou, J\. Cai, Z\. Zhai, N\. Ding, C\. Jia, G\. Zeng, dahai li, Z\. Liu, and M\. Sun \(2024\)MiniCPM: unveiling the potential of small language models with scalable training strategies\.External Links:[Link](https://openreview.net/forum?id=3X2L2TFr0f)Cited by:[§A\.3](https://arxiv.org/html/2606.06888#A1.SS3.p1.6)\.
- A\. Huang, A\. Li, A\. Kong, B\. Wang, B\. Jiao, B\. Dong, B\. Wang, B\. Chen, B\. Li, B\. Ma,et al\.\(2026\)Step 3\.5 flash: open frontier\-level intelligence with 11b active parameters\.arXiv preprint arXiv:2602\.10604\.Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2606.06888#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.06888#S2.SS1.p2.3),[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px1.p1.1)\.
- K\. Kim, S\. Kotha, Y\. Choi, T\. Hashimoto, N\. Haber, and P\. Liang \(2026a\)Data\-efficient pre\-training by scaling synthetic megadocs\.arXiv preprint arXiv:2603\.18534\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px2.p1.1)\.
- K\. Kim, S\. Kotha, P\. Liang, and T\. Hashimoto \(2026b\)Pre\-training under infinite compute\.External Links:[Link](https://openreview.net/forum?id=ck0aZTAnwK)Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1),[§A\.2](https://arxiv.org/html/2606.06888#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2606.06888#A1.SS3.p1.6),[§B\.1](https://arxiv.org/html/2606.06888#A2.SS1.p1.8),[§B\.6](https://arxiv.org/html/2606.06888#A2.SS6.p2.2),[Table 13](https://arxiv.org/html/2606.06888#A2.T13),[Table 13](https://arxiv.org/html/2606.06888#A2.T13.3.2),[Table 16](https://arxiv.org/html/2606.06888#A2.T16),[Table 16](https://arxiv.org/html/2606.06888#A2.T16.11.2),[§1](https://arxiv.org/html/2606.06888#S1.p2.1),[§1](https://arxiv.org/html/2606.06888#S1.p9.1),[§2\.1](https://arxiv.org/html/2606.06888#S2.SS1.p3.2),[§3\.1](https://arxiv.org/html/2606.06888#S3.SS1.p1.3),[§4\.2](https://arxiv.org/html/2606.06888#S4.SS2.p1.5),[§4\.2](https://arxiv.org/html/2606.06888#S4.SS2.p2.1),[§4](https://arxiv.org/html/2606.06888#S4.p2.7),[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px2.p1.1)\.
- M\. Lewis, Y\. Liu, N\. Goyal, M\. Ghazvininejad, A\. Mohamed, O\. Levy, V\. Stoyanov, and L\. Zettlemoyer \(2020\)BART: denoising sequence\-to\-sequence pre\-training for natural language generation, translation, and comprehension\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 7871–7880\.External Links:[Link](https://aclanthology.org/2020.acl-main.703/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.703)Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px4.p1.1)\.
- J\. Li, A\. Fang, G\. Smyrnis, M\. Ivgi, M\. Jordan, S\. Gadre, H\. Bansal, E\. Guha, S\. Keh, K\. Arora, S\. Garg, R\. Xin, N\. Muennighoff, R\. Heckel, J\. Mercat, M\. Chen, S\. Gururangan, M\. Wortsman, A\. Albalak, Y\. Bitton, M\. Nezhurina, A\. Abbas, C\. Hsieh, D\. Ghosh, J\. Gardner, M\. Kilian, H\. Zhang, R\. Shao, S\. Pratt, S\. Sanyal, G\. Ilharco, G\. Daras, K\. Marathe, A\. Gokaslan, J\. Zhang, K\. Chandu, T\. Nguyen, I\. Vasiljevic, S\. Kakade, S\. Song, S\. Sanghavi, F\. Faghri, S\. Oh, L\. Zettlemoyer, K\. Lo, A\. El\-Nouby, H\. Pouransari, A\. Toshev, S\. Wang, D\. Groeneveld, L\. Soldaini, P\. W\. Koh, J\. Jitsev, T\. Kollar, A\. G\. Dimakis, Y\. Carmon, A\. Dave, L\. Schmidt, and V\. Shankar \(2024\)DataComp\-lm: in search of the next generation of training sets for language models\.pp\. 14200–14282\.External Links:[Document](https://dx.doi.org/10.52202/079017-0455),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/19e4ea30dded58259665db375885e412-Paper-Datasets_and_Benchmarks_Track.pdf)Cited by:[§A\.2](https://arxiv.org/html/2606.06888#A1.SS2.p1.1),[Figure 1](https://arxiv.org/html/2606.06888#S1.F1),[Figure 1](https://arxiv.org/html/2606.06888#S1.F1.8.4.4),[§1](https://arxiv.org/html/2606.06888#S1.p7.1),[§3\.4](https://arxiv.org/html/2606.06888#S3.SS4.p3.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§A\.3](https://arxiv.org/html/2606.06888#A1.SS3.p1.6)\.
- A\. Lozhkov, R\. Li, L\. B\. Allal, F\. Cassano, J\. Lamy\-Poirier, N\. Tazi, A\. Tang, D\. Pykhtar, J\. Liu, Y\. Wei, T\. Liu, M\. Tian, D\. Kocetkov, A\. Zucker, Y\. Belkada, Z\. Wang, Q\. Liu, D\. Abulkhanov, I\. Paul, Z\. Li, W\. Li, M\. Risdal, J\. Li, J\. Zhu, T\. Y\. Zhuo, E\. Zheltonozhskii, N\. O\. O\. Dade, W\. Yu, L\. Krauß, N\. Jain, Y\. Su, X\. He, M\. Dey, E\. Abati, Y\. Chai, N\. Muennighoff, X\. Tang, M\. Oblokulov, C\. Akiki, M\. Marone, C\. Mou, M\. Mishra, A\. Gu, B\. Hui, T\. Dao, A\. Zebaze, O\. Dehaene, N\. Patry, C\. Xu, J\. McAuley, H\. Hu, T\. Scholak, S\. Paquet, J\. Robinson, C\. J\. Anderson, N\. Chapados, M\. Patwary, N\. Tajbakhsh, Y\. Jernite, C\. M\. Ferrandis, L\. Zhang, S\. Hughes, T\. Wolf, A\. Guha, L\. von Werra, and H\. de Vries \(2024\)StarCoder 2 and the stack v2: the next generation\.External Links:2402\.19173Cited by:[§A\.2](https://arxiv.org/html/2606.06888#A1.SS2.p2.1),[§1](https://arxiv.org/html/2606.06888#S1.p7.1),[§3\.4](https://arxiv.org/html/2606.06888#S3.SS4.p3.1)\.
- W\. Merrill, Y\. Li, T\. Romero, A\. Svete, C\. Costello, P\. Dasigi, D\. Groeneveld, D\. Heineman, B\. Kuehl, N\. Lambert,et al\.\(2026\)Olmo hybrid: from theory to practice and back\.arXiv preprint arXiv:2604\.03444\.Cited by:[§4\.1](https://arxiv.org/html/2606.06888#S4.SS1.p3.12)\.
- E\. J\. Michaud \(2026\)On neural scaling and the quanta hypothesis\.Learning Mechanics\.External Links:[Link](https://learningmechanics.pub/quanta)Cited by:[Appendix D](https://arxiv.org/html/2606.06888#A4.p1.1),[§1](https://arxiv.org/html/2606.06888#S1.p9.1),[§4\.1](https://arxiv.org/html/2606.06888#S4.SS1.p3.12)\.
- N\. Muennighoff, A\. Rush, B\. Barak, T\. Le Scao, N\. Tazi, A\. Piktus, S\. Pyysalo, T\. Wolf, and C\. A\. Raffel \(2023\)Scaling data\-constrained language models\.pp\. 50358–50376\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/9d89448b63ce1e2e8dc7af72c984c196-Paper-Conference.pdf)Cited by:[§B\.2](https://arxiv.org/html/2606.06888#A2.SS2.p3.5),[Table 19](https://arxiv.org/html/2606.06888#A2.T19),[Table 19](https://arxiv.org/html/2606.06888#A2.T19.2.1),[§1](https://arxiv.org/html/2606.06888#S1.p2.1),[§1](https://arxiv.org/html/2606.06888#S1.p9.1),[§2\.1](https://arxiv.org/html/2606.06888#S2.SS1.p3.2),[§4\.1](https://arxiv.org/html/2606.06888#S4.SS1.p3.3),[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Ni, Q\. Liu, L\. Dou, C\. Du, Z\. Wang, H\. Yan, T\. Pang, and M\. Q\. Shieh \(2025\)Diffusion language models are super data learners\.arXiv preprint arXiv:2511\.03276\.Cited by:[§1](https://arxiv.org/html/2606.06888#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.06888#S3.SS1.p1.3),[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px3.p1.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. ZHOU, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.External Links:[Link](https://openreview.net/forum?id=KnqiC0znVF)Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px3.p1.1)\.
- T\. Olmo, A\. Ettinger, A\. Bertsch, B\. Kuehl, D\. Graham, D\. Heineman, D\. Groeneveld, F\. Brahman, F\. Timbers, H\. Ivison,et al\.\(2025\)Olmo 3\.arXiv preprint arXiv:2512\.13961\.Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1)\.
- T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan,et al\.\(2024\)2 olmo 2 furious\.arXiv preprint arXiv:2501\.00656\.Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1)\.
- G\. Penedo, H\. Kydlíček, L\. B\. allal, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. V\. Werra, and T\. Wolf \(2024\)The fineweb datasets: decanting the web for the finest text data at scale\.External Links:[Link](https://openreview.net/forum?id=n6SCkn2QaG)Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px2.p1.1)\.
- M\. Prabhudesai, M\. Wu, A\. Zadeh, K\. Fragkiadaki, and D\. Pathak \(2025\)Diffusion beats autoregressive in data\-constrained settings\.External Links:[Link](https://openreview.net/forum?id=W5Ht05jF4c)Cited by:[§A\.4](https://arxiv.org/html/2606.06888#A1.SS4.p1.6),[§1](https://arxiv.org/html/2606.06888#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.06888#S3.SS1.p1.3),[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px3.p1.1)\.
- Q Labs \(2026\)NanoGPT Slowrun\.Note:[https://github\.com/qlabs\-eng/slowrun](https://github.com/qlabs-eng/slowrun)GitHub repository\. Accessed: 2026\-04\-28Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px4.p1.1)\.
- J\. S\. Rosenfeld, A\. Rosenfeld, Y\. Belinkov, and N\. Shavit \(2020\)A constructive prediction of the generalization error across scales\.External Links:[Link](https://openreview.net/forum?id=ryenvpEKDr)Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px1.p1.1)\.
- S\. S\. Sahoo, M\. Arriola, Y\. Schiff, A\. Gokaslan, E\. Marroquin, J\. T\. Chiu, A\. Rush, and V\. Kuleshov \(2024\)Simple and effective masked diffusion language models\.Advances in Neural Information Processing Systems37,pp\. 130136–130184\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px3.p1.1)\.
- J\. Sevilla and E\. Roldán \(2024\)Training compute of frontier AI models grows by 4\-5x per year\.Note:Accessed: 2026\-04\-29External Links:[Link](https://epoch.ai/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year)Cited by:[§1](https://arxiv.org/html/2606.06888#S1.p2.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1)\.
- R\. Vershynin \(2018\)High\-dimensional probability: an introduction with applications in data science\.Vol\.47,Cambridge university press\.Cited by:[§C\.3](https://arxiv.org/html/2606.06888#A3.SS3.1.p1.6)\.
- P\. Villalobos, A\. Ho, J\. Sevilla, T\. Besiroglu, L\. Heim, and M\. Hobbhahn \(2024\)Will we run out of data? limits of llm scaling based on human\-generated data\.arXiv preprint arXiv:2211\.04325\.Cited by:[§1](https://arxiv.org/html/2606.06888#S1.p2.1)\.
- J\. Wang, Y\. Hu, Y\. Gao, H\. Wang, S\. Wang, H\. Lu, J\. Mao, W\. X\. Zhao, J\. Li, and X\. Zhang \(2025\)Entropy\-guided token dropout: training autoregressive language models with limited domain data\.arXiv preprint arXiv:2512\.23422\.Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§A\.1](https://arxiv.org/html/2606.06888#A1.SS1.p2.1)\.
- H\. Zhang, S\. Qiu, X\. Duan, and M\. Zhang \(2020\)Token drop mechanism for neural machine translation\.InProceedings of the 28th International Conference on Computational Linguistics,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),Barcelona, Spain \(Online\),pp\. 4298–4303\.External Links:[Link](https://aclanthology.org/2020.coling-main.379/),[Document](https://dx.doi.org/10.18653/v1/2020.coling-main.379)Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px4.p1.1)\.
- X\. Zhuang, Z\. Jia, J\. Li, Z\. Zhang, L\. Shen, Z\. Cao, and S\. Liu \(2025\)Mask\-enhanced autoregressive prediction: pay less attention to learn more\.InProceedings of the 42nd International Conference on Machine LearningInternational Conference on Learning RepresentationsAdvances in Neural Information Processing SystemsFirst Conference on Language ModelingInternational Conference on Learning RepresentationsAdvances in Neural Information Processing SystemsThe Fourteenth International Conference on Learning RepresentationsThe Thirty\-ninth Annual Conference on Neural Information Processing SystemsThe Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks TrackForty\-first International Conference on Machine LearningThe Thirty\-ninth Annual Conference on Neural Information Processing SystemsThe Thirty\-ninth Annual Conference on Neural Information Processing SystemsAdvances in Neural Information Processing SystemsThe Thirteenth International Conference on Learning Representations,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, J\. Zhu, A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, C\. Zhang, A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, S\. Levine, H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267373632,pp\. 80516–80532\.External Links:[Link](https://proceedings.mlr.press/v267/zhuang25b.html)Cited by:[§5](https://arxiv.org/html/2606.06888#S5.SS0.SSS0.Px4.p1.1)\.

## Appendix

## Appendix AExperiment Details

This appendix describes the compute setup, architecture ladder, data splits, training recipes, hyperparameter searches, and auxiliary experimental results\. See the full data generation and training code in[Github](https://github.com/yixinw-lab/dc_pretrain)\. See the Wandb logs at[WandB](https://wandb.ai/zhiwei-xu2000/overtrain-dclm?nw=nwuserzhiweixu2000)\.

### A\.1Compute, Architecture, and Scaling Ladder

All experiments can run on eight 80GB SXM H100 GPUs\. The longest AR model run completes in under 24 hours\. The longest dLLM model run completes in under 48 hours\.

We use a Llama\-style decoder\-only transformer\[Grattafioriet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib34)\]with QK norm and interleaved global\-local self\-attention as the model architecture\. Compared to the architecture used inKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\], we additionally use QK norm and interleaved local and global attention\. QK norm is widely used\[OLMoet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib25), Teamet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib24), Olmoet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib26), Yanget al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib27)\]in recent open\-source large language models to stabilize pretraining, and interleaved local and global attention is also widely used\[Teamet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib24), Olmoet al\.,[2025](https://arxiv.org/html/2606.06888#bib.bib26), Huanget al\.,[2026](https://arxiv.org/html/2606.06888#bib.bib28)\]to reduce compute and reduce KV cache size\. We use the GPT\-2\[Radfordet al\.,[2019](https://arxiv.org/html/2606.06888#bib.bib22)\]tokenizer with one extra \[MASK\] token for random masking\. The vocabulary size is5025850258\.

We follow the scaling ladder

ScalingLadder\(k\)=\(kW1,kL1,S1,B1\),\\text\{ScalingLadder\}\(k\)=\(kW\_\{1\},kL\_\{1\},S\_\{1\},B\_\{1\}\),whereW1=1024W\_\{1\}=1024is the embedding dimension whenk=1k=1,L1=12L\_\{1\}=12is the number of layers whenk=1k=1,S1=2048S\_\{1\}=2048is the sequence length,B1=128B\_\{1\}=128is the total batch size, andk∈\{0\.5,0\.75,1,1\.5,2\}k\\in\\\{0\.5,0\.75,1,1\.5,2\\\}\. Across the scaling ladder, the attention head dimension is fixed at 64, while the depth, embedding dimension, MLP dimension, and number of attention heads increase with scale\. The resulting models span from 71,965,952 parameters to 1,439,273,984 parameters\. Table[3](https://arxiv.org/html/2606.06888#A1.T3)summarizes the full architecture ladder\.

Table 3:Scaling Ladder Details\.kkLayersEmbed dimMLP dimHeadsHead dimModel size0\.56512153686471,965,9520\.75976820481264140,983,6801\.012102428161664257,190,4001\.518153640962464664,200,9602\.0242048563232641,439,273,984
### A\.2Data and Evaluation Splits

FollowingKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\], we use DCLM\-POOL\[Liet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib18)\], an open\-source pretraining dataset containing240240T tokens\. We use the DCLM subset generated byKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\]to construct datasets with 100M, 200M, 300M, and 400M unique training tokens\. Each smaller\-budget dataset is a subset of the corresponding larger\-budget dataset\. We always use the same evaluation dataset, which contains 10M tokens from DCLM\.

We also use Stack\-V2\[Lozhkovet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib19)\]to evaluate whether masked\-input regularization is beneficial for pretraining on code data\. The corresponding validation losses are reported in Table[8](https://arxiv.org/html/2606.06888#A1.T8)\.

### A\.3AR Training Recipe

Unless stated otherwise, AR experiments use the AdamW optimizer\[Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.06888#bib.bib21)\]withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95, andϵ=10−8\\epsilon=10^\{\-8\}\. This config is adopted fromKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\]\. For all AR model pretraining, we use the Warmup\-stable\-decay \(WSD\)\[Huet al\.,[2024](https://arxiv.org/html/2606.06888#bib.bib20)\]learning\-rate schedule with1%1\\%of the total steps for linear warmup and10%10\\%of the total steps for warmdown\. We set dropout rate to be0\.10\.1\. In comparison,Kimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\]uses cosine annealing\. We tried both schedules for standard AR model pretraining and found that WSD performs better across all model sizes we tested\.

### A\.4dLLM Baseline Protocol

For dLLM pretraining, we adopt the config used inPrabhudesaiet al\.\[[2025](https://arxiv.org/html/2606.06888#bib.bib32)\]: batch size 256, sequence length 2048, learning rate schedule with peak2×10−42\\times 10^\{\-4\}, minimum2×10−52\\times 10^\{\-5\}, 1% warmup, cosine decay, weight decay 0\.1, and gradient clipping of 1\.0\. For the number of epochs, we adopt the optimal values reported inPrabhudesaiet al\.\[[2025](https://arxiv.org/html/2606.06888#bib.bib32)\]: 500 epochs for the 257M and 664M models and 800 epochs for the 140M model\. We calculate validation loss after each epoch and report the lowest value\. The 140M model achieves its lowest validation loss3\.6466943\.646694at epoch 789, the 257M model achieves3\.6027633\.602763at epoch 483, and the 664M model achieves3\.6802723\.680272at epoch 141\. We set dropout rate to be0\.10\.1\.

Table[4](https://arxiv.org/html/2606.06888#A1.T4)reports the DCLM 100M validation losses for the AR and dLLM recipes at the three model sizes where we run dLLM pretraining\. The strongly regularized dLLM uses the AR\-tuned weight decay while keeping the other dLLM hyperparameters fixed to the protocol above\.

Table 4:DCLM 100M validation loss for AR and dLLM recipes at different model sizes\. For AR recipes, we report final validation loss; for dLLM recipes, we report the best across\-epoch validation loss\.Model sizeRecipe140M257M664MMulti\-Epoch dLLM3\.6466943\.6027633\.680272Multi\-Epoch AR3\.9452683\.8797823\.821800Strongly Regularized dLLM3\.5794453\.4835983\.387994Strongly Regularized AR3\.4713953\.4221073\.367138MIR3\.4684583\.4048333\.332668
### A\.5Multi\-epoch AR Epoch Search

We search for the best number of epochs for each model size for multi\-epoch AR\. As shown in Figure[4](https://arxiv.org/html/2606.06888#A1.F4), 16 epochs is the best for 140M, 8 epochs is the best for 257M, and 32 epochs is the best for 664M\.

![Refer to caption](https://arxiv.org/html/2606.06888v1/x6.png)

![Refer to caption](https://arxiv.org/html/2606.06888v1/x7.png)

![Refer to caption](https://arxiv.org/html/2606.06888v1/x8.png)

Figure 4:Validation loss vs\. number of epochs\. Weight decay is fixed to 0\.1, peak learning rate is fixed to 2e\-4\. Left: model size 140M; Middle: model size 257M; Right: model size 664M\.
### A\.6Strongly Regularized Baseline Search

The strongly regularized baseline sweeps are conducted in the data\-constrained DCLM setting described in the main text\. We run separate searches at unique\-data budgets of 100M, 200M, 300M, and 400M tokens\. Within each budget, we use the same training and evaluation datasets across all model scales so that differences in performance can be attributed to model size and training objective rather than differences in data exposure\.

We tune the optimization settings separately for each model scale and data budget\. The search space consists of the number of training epochs, weight decay, and learning rate\. In general, larger models prefer fewer epochs and stronger weight decay, while the selected learning rates remain in the range of10−310^\{\-3\}to10−210^\{\-2\}\. We describe the full 100M sweeps first, then append the larger\-budget searches used in the scaling\-law analysis\.

72M model\.We search over epochs\{16,32,64\}\\\{16,32,64\\\}, weight decay\{0\.4,0\.8,1\.6\}\\\{0\.4,0\.8,1\.6\\\}, and learning rate\{10−3,3×10−3,10−2\}\\\{10^\{\-3\},3\\times 10^\{\-3\},10^\{\-2\}\\\}\. We additionally run a refined sweep over epochs\{16,32,64\}\\\{16,32,64\\\}, weight decay\{0\.1,0\.2,0\.4\}\\\{0\.1,0\.2,0\.4\\\}, and learning rate\{3×10−3,10−2,3×10−2\}\\\{3\\times 10^\{\-3\},10^\{\-2\},3\\times 10^\{\-2\}\\\}\. The best configuration is\(32,0\.4,10−2\)\(32,0\.4,10^\{\-2\}\)\.

140M model\.We first search over epochs\{8,16,32\}\\\{8,16,32\\\}, weight decay\{0\.8,1\.6,3\.2\}\\\{0\.8,1\.6,3\.2\\\}, and learning rate\{3×10−4,10−3,3×10−3\}\\\{3\\times 10^\{\-4\},10^\{\-3\},3\\times 10^\{\-3\}\\\}\. We then run an additional sweep over epochs\{16,32,64\}\\\{16,32,64\\\}, weight decay\{0\.2,0\.4,0\.8,1\.6\}\\\{0\.2,0\.4,0\.8,1\.6\\\}, and learning rate\{10−3,3×10−3,10−2,3×10−2\}\\\{10^\{\-3\},3\\times 10^\{\-3\},10^\{\-2\},3\\times 10^\{\-2\}\\\}\. The best configuration is\(32,0\.8,3×10−3\)\(32,0\.8,3\\times 10^\{\-3\}\)\.

257M model\.We search over epochs\{8,16,32\}\\\{8,16,32\\\}, weight decay\{0\.8,1\.6,3\.2\}\\\{0\.8,1\.6,3\.2\\\}, and learning rate\{3×10−4,10−3,3×10−3\}\\\{3\\times 10^\{\-4\},10^\{\-3\},3\\times 10^\{\-3\}\\\}\. The best configuration is\(16,1\.6,10−3\)\(16,1\.6,10^\{\-3\}\)\.

664M model\.We search over epochs\{8,16,32\}\\\{8,16,32\\\}, weight decay\{0\.8,1\.6,3\.2\}\\\{0\.8,1\.6,3\.2\\\}, and learning rate\{3×10−4,10−3,3×10−3\}\\\{3\\times 10^\{\-4\},10^\{\-3\},3\\times 10^\{\-3\}\\\}\. The best configuration is\(16,1\.6,10−3\)\(16,1\.6,10^\{\-3\}\)\.

1\.4B model\.We search over epochs\{4,8,16\}\\\{4,8,16\\\}, weight decay\{1\.6,3\.2,6\.4\}\\\{1\.6,3\.2,6\.4\\\}, and learning rate\{3×10−4,10−3,3×10−3\}\\\{3\\times 10^\{\-4\},10^\{\-3\},3\\times 10^\{\-3\}\\\}\. The best configuration is\(16,3\.2,10−3\)\(16,3\.2,10^\{\-3\}\)\.

Table[5](https://arxiv.org/html/2606.06888#A1.T5)summarizes the selected hyperparameters at each scale\.

Table 5:Best strongly regularized hyperparameter configuration in the 100M unique\-token setting\.Model sizeBest\(epochs,weight decay,lr\)\(\\text\{epochs\},\\text\{weight decay\},\\text\{lr\}\)72M\(32,0\.4,10−2\)\(32,0\.4,10^\{\-2\}\)140M\(32,0\.8,3×10−3\)\(32,0\.8,3\\times 10^\{\-3\}\)257M\(16,1\.6,10−3\)\(16,1\.6,10^\{\-3\}\)664M\(16,1\.6,10−3\)\(16,1\.6,10^\{\-3\}\)1\.4B\(16,3\.2,10−3\)\(16,3\.2,10^\{\-3\}\)Table[6](https://arxiv.org/html/2606.06888#A1.T6)summarizes the selected hyperparameters for the larger unique\-data budgets used in the scaling\-law analysis\. For 200M and 400M unique tokens, we run budget\-specific sweeps\. For the intermediate 300M budget, we evaluate candidate configurations inherited from the selected 200M and 400M settings at each model scale\. The longest runs take around 2 hours at 200M, 6 hours at 300M, and 8 hours at 400M on 8 H100s\.

Table 6:Selected strongly regularized hyperparameter configurations for the larger unique\-data budgets\. Each entry is\(epochs,weight decay,lr\)\(\\text\{epochs\},\\text\{weight decay\},\\text\{lr\}\)\.Unique dataModel sizeBest\(epochs,weight decay,lr\)\(\\text\{epochs\},\\text\{weight decay\},\\text\{lr\}\)200M72M\(64,0\.2,10−2\)\(64,0\.2,10^\{\-2\}\)200M140M\(32,0\.4,3×10−3\)\(32,0\.4,3\\times 10^\{\-3\}\)200M257M\(16,0\.8,10−3\)\(16,0\.8,10^\{\-3\}\)200M664M\(16,1\.6,10−3\)\(16,1\.6,10^\{\-3\}\)200M1\.4B\(16,1\.6,10−3\)\(16,1\.6,10^\{\-3\}\)300M72M\(64,0\.1,10−2\)\(64,0\.1,10^\{\-2\}\)300M140M\(64,0\.2,3×10−3\)\(64,0\.2,3\\times 10^\{\-3\}\)300M257M\(32,0\.8,10−3\)\(32,0\.8,10^\{\-3\}\)300M664M\(32,0\.8,10−3\)\(32,0\.8,10^\{\-3\}\)300M1\.4B\(32,1\.6,10−3\)\(32,1\.6,10^\{\-3\}\)400M72M\(64,0\.1,10−2\)\(64,0\.1,10^\{\-2\}\)400M140M\(64,0\.2,3×10−3\)\(64,0\.2,3\\times 10^\{\-3\}\)400M257M\(32,0\.4,10−3\)\(32,0\.4,10^\{\-3\}\)400M664M\(32,0\.8,10−3\)\(32,0\.8,10^\{\-3\}\)400M1\.4B\(32,0\.8,10−3\)\(32,0\.8,10^\{\-3\}\)
### A\.7MIR Hyperparameter Tuning

In MIR, for each sequencexx, a mask ratiorris sampled fromUnif\(rmin,rmax\)\\text\{Unif\}\(r\_\{\\min\},r\_\{\\max\}\), then for each positiont∈\[0,T−1\]t\\in\[0,T\-1\], we use a Bernoulli random variable with success probabilityrrto decide whether to maskxtx\_\{t\}\. Denote the masked version asx~\\widetilde\{x\}\. We optimize

ℒ=ℒNTP\(x\)\+λℒNTP\(x~\)\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(x\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(\\widetilde\{x\}\)\.We tune the values ofrmin,rmax,λr\_\{\\min\},r\_\{\\max\},\\lambdausing the1\.41\.4B model and DCLM 100M\. See the results in Figure[5](https://arxiv.org/html/2606.06888#A1.F5)\. The selected values arermin=0,rmax=0\.5,λ=0\.4r\_\{\\min\}=0,r\_\{\\max\}=0\.5,\\lambda=0\.4\. We also tried

ℒ=\(1−λ\)ℒNTP\(x\)\+λℒNTP\(x~\),\\mathcal\{L\}=\(1\-\\lambda\)\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(x\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(\\widetilde\{x\}\),but its performance was slightly worse thanℒNTP\(x\)\+λℒNTP\(x~\)\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(x\)\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{NTP\}\}\(\\widetilde\{x\}\)\.

![Refer to caption](https://arxiv.org/html/2606.06888v1/x9.png)Figure 5:Tuning the mask ratio bounds \(rminr\_\{\\min\},rmaxr\_\{\\max\}\) and regularization coefficientλ\\lambda\.
### A\.8Auxiliary Experimental Results

Table[7](https://arxiv.org/html/2606.06888#A1.T7)reports the best evaluation loss across model scales in the 100M unique\-token setting\. Table[8](https://arxiv.org/html/2606.06888#A1.T8)reports the Stack\-V2 validation losses\.

Table 7:Best evaluation loss across model scales in the 100M unique\-token setting \(seed 42\)\.Recipe72M140M257M664M1\.4BSingle\-epoch4\.8661054\.9608205\.0257385\.0197385\.302995Strongly Regularized Recipe \(Baseline\)3\.6159033\.4713953\.4221073\.3671383\.339578MIR3\.6136213\.4684583\.4048333\.3326683\.308170Table 8:Validation loss on the Stack\-V2 100M unique token dataset\. MIR consistently outperforms the strongly regularized baseline across all model scales\.Recipe72M140M257M664M1\.4BRegularized Baseline1\.0641\.0201\.0050\.9960\.983MIR1\.0541\.0120\.9850\.9880\.967
### A\.9Dataset Licenses

DatasetUse in this paperVersion / URLLicense and termsDCLM\-PoolNatural\-language pretraining and validation data\.[Link](https://huggingface.co/collections/mlfoundations/dclm-pools)CC BY 4\.0\. DCLM\-Pool is derived from Common Crawl and is also subject to the Common Crawl Terms of Use\. We cite the original DCLM paper and do not redistribute the raw dataset\.The Stack v2Code\-heavy pretraining data for the Stack\-v2 experiments\.[Link](https://huggingface.co/datasets/bigcode/the-stack-v2)\. Version: 2\.1\.0No single dataset\-wide content license; Hugging Face lists the license as “other”\. The dataset contains source code from repositories with various original licenses\. User must comply with upstream licenses, including attribution clauses where relevant, the Stack\-v2 access terms, Software Heritage principles for language\-model training, and validated removal\-request updates\. We do not redistribute raw Stack\-v2 files\.Table 9:Existing datasets used in this paper and their licenses or terms of use\.

## Appendix BDetails of the Scaling\-Law Analysis

This appendix gives the full scaling\-law definitions, fitting protocol, fitted constants, residual diagnostics, and the plots moved out of the main text\.

### B\.1Setup, Notation, and Fitting Objective

We useNNfor model size andUUfor unique training tokens, both measured in billions\. The baseline grid contains55model sizes\{72M,140M,257M,664M,1\.4B\}\\\{72\\mathrm\{M\},140\\mathrm\{M\},257\\mathrm\{M\},664\\mathrm\{M\},1\.4\\mathrm\{B\}\\\}and44unique\-token budgets\{100M,200M,300M,400M\}\\\{100\\mathrm\{M\},200\\mathrm\{M\},300\\mathrm\{M\},400\\mathrm\{M\}\\\}, for2020strongly regularized baseline points\. The external grid is provided inKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\]\. For our dataset and the external one, the repetition variable in the Muennighoff\-style law isRD=epochs−1R\_\{D\}=\\mathrm\{epochs\}\-1, where the epoch count is taken from the best configuration or run identifier\.

For Chinchilla, Quanta, and SoftQ, we minimize the Approach\-3\-style objective ofHoffmannet al\.\[[2022](https://arxiv.org/html/2606.06888#bib.bib29)\]:

minθ∑iHuber10−3\(log⁡L^θ\(Ni,Ui\)−log⁡Li\)\.\\min\_\{\\theta\}\\sum\_\{i\}\\mathrm\{Huber\}\_\{10^\{\-3\}\}\\left\(\\log\\widehat\{L\}\_\{\\theta\}\(N\_\{i\},U\_\{i\}\)\-\\log L\_\{i\}\\right\)\.\(5\)All reported RMSE, MAE, and RSS values are computed afterward on the raw validation\-loss scale\. AIC is the SSE\-based Gaussian criterion

AIC=nlog⁡\(RSS/n\)\+2k,\\mathrm\{AIC\}=n\\log\(\\mathrm\{RSS\}/n\)\+2k,\(6\)where constants independent of the model are omitted\. Since the fitted objective is Huber loss on log residuals, this AIC should be read as a common raw\-loss summary criterion rather than the exact likelihood optimized during fitting\.

### B\.2Candidate Scaling Laws

The Chinchilla law is

LCh\(N,U\)=E\+ANα\+BUβ\.L\_\{\\mathrm\{Ch\}\}\(N,U\)=E\+\\frac\{A\}\{N^\{\\alpha\}\}\+\\frac\{B\}\{U^\{\\beta\}\}\.\(7\)Its additive form implies a data\-independent model\-size gap:LCh\(N1,U\)−LCh\(N2,U\)L\_\{\\mathrm\{Ch\}\}\(N\_\{1\},U\)\-L\_\{\\mathrm\{Ch\}\}\(N\_\{2\},U\)does not depend onUU\. Figure[7](https://arxiv.org/html/2606.06888#A2.F7)shows that the observed DCLM baseline gaps vary with the unique\-token budget\.

The Quanta\-motivated joint law is

LQ\(N,U\)=E\+\(AN\+BU1/\(1\+α\)\)α\.L\_\{\\mathrm\{Q\}\}\(N,U\)=E\+\\left\(\\frac\{A\}\{N\}\+\\frac\{B\}\{U^\{1/\(1\+\\alpha\)\}\}\\right\)^\{\\alpha\}\.\(8\)It couples the parameter and data axes before applying the outer power, so the marginal value of additional parameters depends on the available data\.

The Muennighoff\-style law replaces raw resources by effective resources:

LM\(N,U,RD\)\\displaystyle L\_\{\\mathrm\{M\}\}\(N,U,R\_\{D\}\)=E\+A\(N′\)α\+B\(D′\)β,\\displaystyle=E\+\\frac\{A\}\{\(N^\{\\prime\}\)^\{\\alpha\}\}\+\\frac\{B\}\{\(D^\{\\prime\}\)^\{\\beta\}\},\(9\)D′\\displaystyle D^\{\\prime\}=U\+URD⋆\(1−exp⁡\[−RD/RD⋆\]\),\\displaystyle=U\+UR\_\{D\}^\{\\star\}\\left\(1\-\\exp\[\-R\_\{D\}/R\_\{D\}^\{\\star\}\]\\right\),\(10\)N′\\displaystyle N^\{\\prime\}=UN\+UNRN⋆\(1−exp⁡\[−RN/RN⋆\]\)\.\\displaystyle=U\_\{N\}\+U\_\{N\}R\_\{N\}^\{\\star\}\\left\(1\-\\exp\[\-R\_\{N\}/R\_\{N\}^\{\\star\}\]\\right\)\.\(11\)Given the base Chinchilla coefficients, we compute the one\-epoch optimal parameter count

Nopt\(U\)=\(αAβB\)1/αUβ/α,N\_\{\\mathrm\{opt\}\}\(U\)=\\left\(\\frac\{\\alpha A\}\{\\beta B\}\\right\)^\{1/\\alpha\}U^\{\\beta/\\alpha\},\(12\)then setUN=min⁡\{N,Nopt\(U\)\}U\_\{N\}=\\min\\\{N,N\_\{\\mathrm\{opt\}\}\(U\)\\\}andRN=N/UN−1R\_\{N\}=N/U\_\{N\}\-1\. It has seven parameters to fit:\{A,B,E,α,β,RN⋆,RD⋆\}\\\{A,B,E,\\alpha,\\beta,R\_\{N\}^\{\\star\},R\_\{D\}^\{\\star\}\\\}\. Our main comparison uses a dataset\-adapted two\-stage protocol: fit the base Chinchilla coefficients on the relevant split, then fit onlyRN⋆R\_\{N\}^\{\\star\}andRD⋆R\_\{D\}^\{\\star\}\. This is the appropriate comparison if the goal is to evaluate the effective\-resource functional form on our loss scale\. The literal fixed\-C4 coefficients fromMuennighoffet al\.\[[2023](https://arxiv.org/html/2606.06888#bib.bib23)\]are included as an ablation in Table[19](https://arxiv.org/html/2606.06888#A2.T19)\.

SoftQ is

LSoftQ\(N,U\)=E\+\(AN−ρ\+BU−ρ/\(1\+α\)\)α/ρ\.L\_\{\\mathrm\{SoftQ\}\}\(N,U\)=E\+\\left\(AN^\{\-\\rho\}\+BU^\{\-\\rho/\(1\+\\alpha\)\}\\right\)^\{\\alpha/\\rho\}\.\(13\)Whenρ=1\\rho=1, this reduces to the Quanta law\. The parameterρ\\rhocontrols the softness of the transition between the parameter\-limited and data\-limited regimes\.

### B\.3Fit Quality and Model Selection

Table 10:Full fit on all2020strongly regularized baseline points\. Lower is better\.Law\# paramsRMSEMAEAICChinchilla50\.0265280\.018016\-135\.18Quanta40\.0125170\.008889\-167\.23Muennighoff70\.0233450\.017130\-136\.29SoftQ50\.0080150\.005204\-183\.06Table 11:Fit on the DCLM100100M/200200M/300300M baseline points and evaluation on the held\-out400400M points\.LawTrain RMSETrain MAEHeld\-out RMSEHeld\-out MAEChinchilla0\.0246360\.0162230\.0310630\.025396Quanta0\.0124300\.0088530\.0149750\.012073Muennighoff0\.0232080\.0162160\.0325190\.027111SoftQ0\.0088500\.0055020\.0059550\.004708Table 12:Held\-out residuals on DCLM400400M, predicted minus observed\.Law72M140M257M664M1\.4BChinchilla\-0\.060050\-0\.024451\-0\.011210\+0\.017412\+0\.013857Quanta\+0\.028961\+0\.011817\-0\.009568\-0\.004273\-0\.005744Muennighoff\-0\.061890\-0\.025911\-0\.011861\+0\.018631\+0\.017262SoftQ\-0\.002392\+0\.001378\-0\.008994\-0\.001468\-0\.009308SoftQ has the best aggregate held\-out RMSE and MAE\. It is closest to zero on four of the five held\-out model sizes; Quanta is slightly closer at the 1\.4B point\.

Table 13:Full fit on the regularized\-baseline points provided byKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\]\.Law\# paramsRMSEMAEAICChinchilla50\.0404120\.025554\-92\.68Quanta40\.0237500\.014726\-111\.69Muennighoff70\.0329890\.022119\-95\.17SoftQ50\.0078540\.005955\-145\.10
### B\.4Fitted Constants and Selected SoftQ Law

See Table[14](https://arxiv.org/html/2606.06888#A2.T14),[15](https://arxiv.org/html/2606.06888#A2.T15), and[16](https://arxiv.org/html/2606.06888#A2.T16)for the fitted constants of each scaling law in each scenario\. Specifically, on the full DCLM grid, the fitted SoftQ law is

LSoftQ\(N,U\)=0\.30565\+\(39\.2962N−0\.79608\+92\.4362U−0\.69676\)0\.17906\\boxed\{L\_\{\\mathrm\{SoftQ\}\}\(N,U\)=0\.30565\+\\left\(39\.2962\\,N^\{\-0\.79608\}\+92\.4362\\,U^\{\-0\.69676\}\\right\)^\{0\.17906\}\}\(14\)withNNandUUin billions\. We therefore use Eq\. \([14](https://arxiv.org/html/2606.06888#A2.E14)\) as the regularized\-baseline law for the MIR data\-efficiency calculation\.

Table 14:Fitted constants on all2020DCLM baseline points\. For Muennighoff,A,α,B,β,EA,\\alpha,B,\\beta,Eare the first\-stage Chinchilla coefficients\.LawAAα\\alphaBBβ\\betaρ\\rhoEEExtraChinchilla0\.12940\.51670\.53570\.2924–2\.1116–Quanta242\.58820\.1354564\.4767––0\.2283–Muennighoff0\.12940\.51670\.53570\.2924–2\.1116RN⋆=31\.39R\_\{N\}^\{\\star\}=31\.39,RD⋆=0\.024R\_\{D\}^\{\\star\}=0\.024SoftQ39\.29620\.142592\.4362–0\.79610\.3056–Table 15:Fitted constants for the held\-out extrapolation experiment, trained only on the DCLM100100M/200200M/300300M points\.LawAAα\\alphaBBβ\\betaρ\\rhoEEExtraChinchilla0\.13630\.47880\.98230\.1926–1\.6310–Quanta799\.57720\.12631769\.1342––5\.4×10−85\.4\{\\times\}10^\{\-8\}–Muennighoff0\.13630\.47880\.98230\.1926–1\.6310RN⋆=90\.51R\_\{N\}^\{\\star\}=90\.51,RD⋆=0\.008R\_\{D\}^\{\\star\}=0\.008SoftQ128\.72800\.1287295\.7854–0\.78531\.9×10−61\.9\{\\times\}10^\{\-6\}–Table 16:Fitted constants on the external grid provided byKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\]\.LawAAα\\alphaBBβ\\betaρ\\rhoEEExtraChinchilla0\.05431\.15510\.36590\.4594–2\.6590–Quanta0\.13420\.49590\.4205––2\.3197–Muennighoff0\.05431\.15510\.36590\.4594–2\.6590RN⋆=2\.13R\_\{N\}^\{\\star\}=2\.13,RD⋆=0\.096R\_\{D\}^\{\\star\}=0\.096SoftQ0\.06130\.59050\.2565–1\.44682\.4360–
### B\.5MIR Data\-Efficiency Calculation

Table 17:MIR parameter\-scaling fits used for data\-efficiency estimation\.MIR unique dataAUA\_\{U\}αU\\alpha\_\{U\}MIR asymptoteEUE\_\{U\}100M0\.038290\.821863\.27997200M0\.132930\.495922\.95596300M0\.139390\.513072\.83953400M0\.156170\.510062\.74826Using the full\-DCLM SoftQ fit, the baseline infinite\-model curve is

LReg,∞\(U\)=0\.30565\+2\.24905U−0\.12476,L\_\{\\mathrm\{Reg\},\\infty\}\(U\)=0\.30565\+2\.24905\\,U^\{\-0\.12476\},whereUUis in billions\. SolvingLReg,∞\(Ueq\)=EUL\_\{\\mathrm\{Reg\},\\infty\}\(U\_\{\\mathrm\{eq\}\}\)=E\_\{U\}gives the data\-efficiency ratios reported in Table[18](https://arxiv.org/html/2606.06888#A2.T18)\.

Table 18:MIR unique\-data efficiency relative to the strongly regularized baseline, using SoftQ to model the baseline infinite\-model curve\.MIR unique dataMIR asymptoteEUE\_\{U\}Baseline\-equivalentUeqU\_\{\\mathrm\{eq\}\}Data efficiency100M3\.27997106\.4M1\.06×\\times200M2\.95596268\.2M1\.34×\\times300M2\.83953384\.5M1\.28×\\times400M2\.74826515\.9M1\.29×\\times
### B\.6Sensitivity Analyses

The original Muennighoff paper fixes the base Chinchilla coefficients to a C4\-calibrated law and fits onlyRN⋆,RD⋆R\_\{N\}^\{\\star\},R\_\{D\}^\{\\star\}\. Because those base coefficients are on a different corpus and loss scale, they are not the main comparison in this paper\. Table[19](https://arxiv.org/html/2606.06888#A2.T19)reports the literal fixed\-C4 variant for completeness\.

Table 19:Literal fixed\-C4 Muennighoff variant\. This fixes the base Chinchilla law to the coefficients fromMuennighoffet al\.\[[2023](https://arxiv.org/html/2606.06888#bib.bib23)\]and fits onlyRN⋆,RD⋆R\_\{N\}^\{\\star\},R\_\{D\}^\{\\star\}\.Dataset / splitRMSEMAEAICNotesDCLM full fit0\.0602570\.047360\-108\.37RN⋆=119\.82R\_\{N\}^\{\\star\}=119\.82,RD⋆=9\.995R\_\{D\}^\{\\star\}=9\.995DCLM held\-out400400M0\.0717340\.064728–train RMSE=0\.055268=0\.055268Kim et al full fit0\.1202310\.098342\-63\.79RN⋆=108R\_\{N\}^\{\\star\}=10^\{8\},RD⋆=0\.927R\_\{D\}^\{\\star\}=0\.927Table 20:MIR data efficiency under each fitted regularized\-baseline law\. We fit each baseline law on the same 20 DCLM regularized points\.UeqU\_\{\\mathrm\{eq\}\}is the amount of unique data the corresponding regularized\-baseline infinite\-model curve needs to match the MIR asymptoteEUE\_\{U\}\.ChinchillaQuantaMuennighoffSoftQMIRUUUeqU\_\{\\mathrm\{eq\}\}Eff\.UeqU\_\{\\mathrm\{eq\}\}Eff\.UeqU\_\{\\mathrm\{eq\}\}Eff\.UeqU\_\{\\mathrm\{eq\}\}Eff\.100M69\.5M0\.70×\\times115\.0M1\.15×\\times92\.4M0\.92×\\times106\.4M1\.06×\\times200M211\.1M1\.06×\\times294\.7M1\.47×\\times280\.6M1\.40×\\times268\.2M1\.34×\\times300M350\.6M1\.17×\\times424\.9M1\.42×\\times466\.1M1\.55×\\times384\.5M1\.28×\\times400M554\.3M1\.39×\\times572\.7M1\.43×\\times737\.0M1\.84×\\times515\.9M1\.29×\\timesThe main paper reports MIR data efficiency using SoftQ because it is the selected baseline law by full\-fit AIC, held\-out prediction, and the external check using data fromKimet al\.\[[2026b](https://arxiv.org/html/2606.06888#bib.bib31)\]\. For completeness, we also compute the same quantity under the Chinchilla, Quanta, and Muennighoff\-style laws fitted on the full DCLM strongly regularized baseline grid\. The Chinchilla, Quanta, and SoftQ fits use the Approach\-3 log\-Huber objective in Eq\. \([5](https://arxiv.org/html/2606.06888#A2.E5)\)\. The Muennighoff\-style fit uses the two\-stage protocol described above: fit the base Chinchilla coefficients on the same DCLM grid, then hold those coefficients fixed and fit onlyRN⋆R\_\{N\}^\{\\star\}andRD⋆R\_\{D\}^\{\\star\}\.

For each law, we define the regularized\-baseline infinite\-model curveLReg,∞\(U\)L\_\{\\mathrm\{Reg\},\\infty\}\(U\)and solveLReg,∞\(Ueq\)=EUL\_\{\\mathrm\{Reg\},\\infty\}\(U\_\{\\mathrm\{eq\}\}\)=E\_\{U\}, whereEUE\_\{U\}is the MIR parameter\-scaling asymptote in Table[17](https://arxiv.org/html/2606.06888#A2.T17)\. The data efficiency ratio isUeq/UMIRU\_\{\\mathrm\{eq\}\}/U\_\{\\mathrm\{MIR\}\}\. For Chinchilla, Quanta, and SoftQ, the infinite\-model curves are obtained by takingN→∞N\\to\\infty\. For the Muennighoff\-style law, we take bothN→∞N\\to\\inftyand the saturated repeated\-data limitRD→∞R\_\{D\}\\to\\infty, giving

LM,∞\(U\)=E\+A\{\(1\+RN⋆\)Nopt\(U\)\}α\+B\{\(1\+RD⋆\)U\}β,L\_\{\\mathrm\{M\},\\infty\}\(U\)=E\+\\frac\{A\}\{\\\{\(1\+R\_\{N\}^\{\\star\}\)N\_\{\\mathrm\{opt\}\}\(U\)\\\}^\{\\alpha\}\}\+\\frac\{B\}\{\\\{\(1\+R\_\{D\}^\{\\star\}\)U\\\}^\{\\beta\}\},where

Nopt\(U\)=\(αAβB\)1/αUβ/α\.N\_\{\\mathrm\{opt\}\}\(U\)=\\left\(\\frac\{\\alpha A\}\{\\beta B\}\\right\)^\{1/\\alpha\}U^\{\\beta/\\alpha\}\.With the fitted constants in Table[14](https://arxiv.org/html/2606.06888#A2.T14), the resulting one\-dimensional curves are

LCh,∞\(U\)\\displaystyle L\_\{\\mathrm\{Ch\},\\infty\}\(U\)=2\.11164\+0\.53575U−0\.29241,\\displaystyle=2\.11164\+0\.53575\\,U^\{\-0\.29241\},LQ,∞\(U\)\\displaystyle L\_\{\\mathrm\{Q\},\\infty\}\(U\)=0\.22834\+2\.35787U−0\.11924,\\displaystyle=0\.22834\+2\.35787\\,U^\{\-0\.11924\},LM,∞\(U\)\\displaystyle L\_\{\\mathrm\{M\},\\infty\}\(U\)=2\.11164\+0\.58227U−0\.29241,\\displaystyle=2\.11164\+0\.58227\\,U^\{\-0\.29241\},LSoftQ,∞\(U\)\\displaystyle L\_\{\\mathrm\{SoftQ\},\\infty\}\(U\)=0\.30565\+2\.24905U−0\.12476,\\displaystyle=0\.30565\+2\.24905\\,U^\{\-0\.12476\},whereUUis measured in billions of unique tokens\.

The alternative\-law estimates vary substantially\. In particular, the additive Chinchilla fit gives a sub\-unity ratio at 100M because its infinite\-model curve already predicts a loss below the MIR asymptote at 100M, which is another symptom of the decoupled law being misspecified in this regime\. Quanta and Muennighoff generally produce larger ratios than SoftQ at 200M–400M, while SoftQ gives the most conservative estimates among the coupled laws that also passed the held\-out and external\-data checks\. For this reason, we keep the SoftQ\-based ratios as the main\-text estimate and report the other laws only as sensitivity analyses\. We also observe that the data efficiency ratio difference under different scaling laws generally shrinks as the unique data size increases\.

### B\.7Additional Visualizations

Figure[6](https://arxiv.org/html/2606.06888#A2.F6)gives the absolute\-loss view of the Chinchilla and SoftQ fits\. Figures[7](https://arxiv.org/html/2606.06888#A2.F7)and[8](https://arxiv.org/html/2606.06888#A2.F8)provide additional views of the baseline and MIR scaling results\.

![Refer to caption](https://arxiv.org/html/2606.06888v1/x10.png)
![Refer to caption](https://arxiv.org/html/2606.06888v1/x11.png)

Figure 6:Absolute\-loss view of the fitted Chinchilla and SoftQ laws on the2020strongly regularized baseline points\. Left: Chinchilla fit\. Right: SoftQ fit\. Points are observed validation losses and curves are model predictions\.![Refer to caption](https://arxiv.org/html/2606.06888v1/x12.png)Figure 7:Regularized baseline validation loss as a function of unique training data sizeUUacross five model sizes\. The changing separation between curves contradicts the data\-independent model\-size gap implied by the additive Chinchilla form\.![Refer to caption](https://arxiv.org/html/2606.06888v1/x13.png)Figure 8:Scaling curves across four unique\-data budgets for the strongly regularized baseline and MIR\. MIR improves validation loss at most model\-size and data\-budget pairs, and the asymptotic fits in Table[17](https://arxiv.org/html/2606.06888#A2.T17)quantify the infinite\-model limit\.

## Appendix CWhy Masking Reduces Memorization: A Toy Model

This section gives a toy model for the intuition stated in Section[3\.3](https://arxiv.org/html/2606.06888#S3.SS3)of the main text: masked\-input regularization reduces validation loss by reducing dependence on context\-specific components \(noise\) and preserving a signal through generalizable components\. The intention in this section is not to model transformer pretraining in full, but to isolate one mechanism that becomes important in the data\-constrained, compute\-rich regime\.

We decompose each training sequence into three parts: a context\-specific component, a generalizable component, and an output token\. The context\-specific component can identify individual training examples and therefore enables memorization\. The generalizable component contains predictive information that also appears in validation examples\. A sufficiently large model can fit the finite training set through the context\-specific component alone; however, this fit does not transfer to validation examples with unseen context\-specific components\. Masking changes this because it can sometimes hide the context\-specific component while leaving the generalizable component visible\. On such masked inputs, prediction through memorization is unavailable, so the model is encouraged to learn predictive patterns from the generalizable component\. Specifically, we introduce the following context\-specific noise model\.

###### Definition C\.1\(Context\-Specific Noise Model\)\.

The training set consists of examples

\(Ci,Si,Yi\),i=1,…,n,\(C\_\{i\},S\_\{i\},Y\_\{i\}\),\\penalty 10000\\ \\penalty 10000\\ i=1,\\ldots,n,whereCiC\_\{i\}is a context\-specific component,Si∈ℝdS\_\{i\}\\in\\mathbb\{R\}^\{d\}is a generalizable component, andYi∈\{−1,\+1\}Y\_\{i\}\\in\\\{\-1,\+1\\\}is the output token to be predicted\. In this model, we consider binary prediction for simplicity and clarity\. We assume

‖Si‖2≤B\.\\\|S\_\{i\}\\\|\_\{2\}\\leq B\.The population validation distribution has the same joint distribution of\(S,Y\)\(S,Y\), but its context\-specific components are unseen during training\.

Let

μ:=𝔼\[YS\]∈ℝd,Σ:=𝔼\[SS⊤\]∈ℝd×d,\\mu:=\\mathbb\{E\}\[YS\]\\in\\mathbb\{R\}^\{d\},\\penalty 10000\\ \\penalty 10000\\ \\Sigma:=\\mathbb\{E\}\[SS^\{\\top\}\]\\in\\mathbb\{R\}^\{d\\times d\},and assumeμ≠0\\mu\\neq 0\. On the finite training set, define

μ^:=1n∑i=1nYiSi,Σ^:=1n∑i=1nSiSi⊤\.\\widehat\{\\mu\}:=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}Y\_\{i\}S\_\{i\},\\penalty 10000\\ \\penalty 10000\\ \\widehat\{\\Sigma\}:=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}S\_\{i\}S\_\{i\}^\{\\top\}\.Let𝐒∈ℝn×d\\mathbf\{S\}\\in\\mathbb\{R\}^\{n\\times d\}be the matrix withii\-th rowSi⊤S\_\{i\}^\{\\top\}, and letY=\(Y1,…,Yn\)⊤Y=\(Y\_\{1\},\\ldots,Y\_\{n\}\)^\{\\top\}\.

###### Definition C\.2\(Clean and MIR Objectives\)\.

For model sizemm, let

ϕi=ϕm\(Ci\)∈ℝm,Φm=\(ϕ1,…,ϕn\)⊤∈ℝn×m,Gm=ΦmΦm⊤\.\\phi\_\{i\}=\\phi\_\{m\}\(C\_\{i\}\)\\in\\mathbb\{R\}^\{m\},\\penalty 10000\\ \\penalty 10000\\ \\Phi\_\{m\}=\(\\phi\_\{1\},\\ldots,\\phi\_\{n\}\)^\{\\top\}\\in\\mathbb\{R\}^\{n\\times m\},\\penalty 10000\\ \\penalty 10000\\ G\_\{m\}=\\Phi\_\{m\}\\Phi\_\{m\}^\{\\top\}\.The model prediction score on exampleiiis

θi=ϕi⊤w\+b⊤Si,\\theta\_\{i\}=\\phi\_\{i\}^\{\\top\}w\+b^\{\\top\}S\_\{i\},wherew∈ℝmw\\in\\mathbb\{R\}^\{m\}models the context\-specific memorization component andb∈ℝdb\\in\\mathbb\{R\}^\{d\}models the generalizable component\. We consider squared and logistic losses,

ℓsq\(y,θ\)=12\(y−θ\)2,ℓlog\(y,θ\)=log⁡\(1\+exp⁡\(−yθ\)\)\.\\ell\_\{\\rm sq\}\(y,\\theta\)=\\frac\{1\}\{2\}\(y\-\\theta\)^\{2\},\\penalty 10000\\ \\penalty 10000\\ \\ell\_\{\\rm log\}\(y,\\theta\)=\\log\(1\+\\exp\(\-y\\theta\)\)\.To model masking, letr∈\[0,1\]r\\in\[0,1\]be a sampled mask ratio\. Conditional onrr, letVC,i,VS,i∈\{0,1\}V\_\{C,i\},V\_\{S,i\}\\in\\\{0,1\\\}be independent visibility indicators with

ℙ\(VC,i=1\|r\)=ℙ\(VS,i=1\|r\)=1−r\.\\mathbb\{P\}\(V\_\{C,i\}=1\\,\|\\,r\)=\\mathbb\{P\}\(V\_\{S,i\}=1\\,\|\\,r\)=1\-r\.The masked prediction score on exampleiiis then

VC,iϕi⊤w\+VS,ib⊤Si\.V\_\{C,i\}\\phi\_\{i\}^\{\\top\}w\+V\_\{S,i\}b^\{\\top\}S\_\{i\}\.Thus masking may remove the context\-specific component, the generalizable component, both, or neither\. For lossℓ∈\{ℓsq,ℓlog\}\\ell\\in\\\{\\ell\_\{\\rm sq\},\\ell\_\{\\rm log\}\\\}, the clean objective is

J^ℓ,clean\(m\)\(w,b\)=1n∑i=1nℓ\(Yi,ϕi⊤w\+b⊤Si\)\+ρw2n‖w‖22\+ρb2‖b‖22,\\widehat\{J\}\_\{\\ell,\{\\rm clean\}\}^\{\(m\)\}\(w,b\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\\\!\\left\(Y\_\{i\},\\phi\_\{i\}^\{\\top\}w\+b^\{\\top\}S\_\{i\}\\right\)\+\\frac\{\\rho\_\{w\}\}\{2n\}\\\|w\\\|\_\{2\}^\{2\}\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\},and the MIR objective is

J^ℓ,MIR\(m\)\(w,b\)=J^ℓ,clean\(m\)\(w,b\)\+λn∑i=1n𝔼M\[ℓ\(Yi,VC,iϕi⊤w\+VS,ib⊤Si\)\],\\widehat\{J\}\_\{\\ell,\{\\rm MIR\}\}^\{\(m\)\}\(w,b\)=\\widehat\{J\}\_\{\\ell,\{\\rm clean\}\}^\{\(m\)\}\(w,b\)\+\\frac\{\\lambda\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbb\{E\}\_\{M\}\\left\[\\ell\\\!\\left\(Y\_\{i\},V\_\{C,i\}\\phi\_\{i\}^\{\\top\}w\+V\_\{S,i\}b^\{\\top\}S\_\{i\}\\right\)\\right\],where the expectation is over the masking randomness\.

###### Assumption C\.3\(Growing Context\-Specific Capacity\)\.

For everymm,Gm≻0G\_\{m\}\\succ 0\. Letam:=λmin\(Gm\)a\_\{m\}:=\\lambda\_\{\\min\}\(G\_\{m\}\), thenam→∞asm→∞a\_\{m\}\\to\\infty\\penalty 10000\\ \\text\{as\}\\penalty 10000\\ m\\to\\infty\.

Assumption[C\.3](https://arxiv.org/html/2606.06888#A3.Thmtheorem3)captures the data\-constrained, compute\-rich regime: the number of training examples is fixed, while the capacity of the context\-specific memorization component grows\. In this regime, for any fixed vector of prediction scores on the finite training set, the context\-specific component can represent that vector with vanishing regularization cost asm→∞m\\to\\infty\.

###### Theorem C\.4\(Behavior of the generalizable component in Clean and MIR training\)\.

Let

h:=𝔼\[\(1−r\)2\],q:=𝔼\[r\(1−r\)\],β:=λq,h:=\\mathbb\{E\}\[\(1\-r\)^\{2\}\],\\penalty 10000\\ \\penalty 10000\\ q:=\\mathbb\{E\}\[r\(1\-r\)\],\\penalty 10000\\ \\penalty 10000\\ \\beta:=\\lambda q,and assumeβ\>0\\beta\>0\. Define

α:=1\+λh,δ:=α\+β,η:=δ−α2δ=β\(δ\+α\)δ\.\\alpha:=1\+\\lambda h,\\penalty 10000\\ \\penalty 10000\\ \\delta:=\\alpha\+\\beta,\\penalty 10000\\ \\penalty 10000\\ \\eta:=\\delta\-\\frac\{\\alpha^\{2\}\}\{\\delta\}=\\frac\{\\beta\(\\delta\+\\alpha\)\}\{\\delta\}\.Under Assumption[C\.3](https://arxiv.org/html/2606.06888#A3.Thmtheorem3), letbclean,sq\(m\)b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\},bMIR,sq\(m\)b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\},bclean,log\(m\)b\_\{\{\\rm clean\},\{\\rm log\}\}^\{\(m\)\}, andbMIR,log\(m\)b\_\{\{\\rm MIR\},\{\\rm log\}\}^\{\(m\)\}denote thebb\-coordinates of minimizers of the corresponding objectives\. Then, asm→∞m\\to\\infty,

bclean,sq\(m\)→0,bMIR,sq\(m\)→b¯sq:=β\(ρbId\+ηΣ^\)−1μ^\.b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}\\to 0,\\penalty 10000\\ \\penalty 10000\\ b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\}\\to\\bar\{b\}\_\{\\rm sq\}:=\\beta\\left\(\\rho\_\{b\}I\_\{d\}\+\\eta\\widehat\{\\Sigma\}\\right\)^\{\-1\}\\widehat\{\\mu\}\.For logistic loss,

bclean,log\(m\)→0,bMIR,log\(m\)→b¯log,b\_\{\{\\rm clean\},\{\\rm log\}\}^\{\(m\)\}\\to 0,\\penalty 10000\\ \\penalty 10000\\ b\_\{\{\\rm MIR\},\{\\rm log\}\}^\{\(m\)\}\\to\\bar\{b\}\_\{\\rm log\},whereb¯log\\bar\{b\}\_\{\\rm log\}is the unique minimizer of

b↦β1n∑i=1nlog⁡\(1\+exp⁡\(−Yib⊤Si\)\)\+ρb2‖b‖22\.b\\mapsto\\beta\\,\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\log\\\!\\left\(1\+\\exp\(\-Y\_\{i\}b^\{\\top\}S\_\{i\}\)\\right\)\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\}\.Moreover, ifμ^≠0\\widehat\{\\mu\}\\neq 0, thenμ^⊤b¯sq\>0\\widehat\{\\mu\}^\{\\top\}\\bar\{b\}\_\{\\rm sq\}\>0andb¯sq≠0\\bar\{b\}\_\{\\rm sq\}\\neq 0\. For logistic loss, ifμ^≠0\\widehat\{\\mu\}\\neq 0, then

b¯log≠0,‖b¯log‖2≤βBρb\.\\bar\{b\}\_\{\\rm log\}\\neq 0,\\penalty 10000\\ \\penalty 10000\\ \\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}\\leq\\frac\{\\beta B\}\{\\rho\_\{b\}\}\.

The theorem formalizes the memorization effect\. Clean training can fit the finite training set through the context\-specific component alone, so the coefficient on the generalizable component vanishes as context\-specific capacity grows\. MIR does not have this degeneracy: because masking sometimes hides the context\-specific component, the limiting objective retains a nonzero training signal for the generalizable component\.

###### Assumption C\.5\(Validation Contexts Are Unseen\)\.

For validation examples, the context\-specific memorization features learned on the training set are unavailable\. We model this as

ϕm\(Cval\)=0\.\\phi\_\{m\}\(C\_\{\\rm val\}\)=0\.Therefore validation predictions depend only on the generalizable logitb⊤Sb^\{\\top\}S\.

This assumption does not say that validation text contains no patterns related to the training text\. It says only that the example\-specific context features used to memorize the finite training corpus do not transfer to unseen validation examples\.

###### Theorem C\.6\(MIR Improves Validation Risk\)\.

Under the assumptions of Theorem[C\.4](https://arxiv.org/html/2606.06888#A3.Thmtheorem4)and Assumption[C\.5](https://arxiv.org/html/2606.06888#A3.Thmtheorem5), define

Rsq\(b\):=𝔼\[\(Y−b⊤S\)2\]=1−2μ⊤b\+b⊤Σb,R\_\{\\rm sq\}\(b\):=\\mathbb\{E\}\\\!\\left\[\(Y\-b^\{\\top\}S\)^\{2\}\\right\]=1\-2\\mu^\{\\top\}b\+b^\{\\top\}\\Sigma b,and

Rlog\(b\):=𝔼\[log⁡\(1\+exp⁡\(−Yb⊤S\)\)\]\.R\_\{\\rm log\}\(b\):=\\mathbb\{E\}\\\!\\left\[\\log\\\!\\left\(1\+\\exp\(\-Yb^\{\\top\}S\)\\right\)\\right\]\.
For squared loss, if2μ⊤b¯sq−b¯sq⊤Σb¯sq\>02\\mu^\{\\top\}\\bar\{b\}\_\{\\rm sq\}\-\\bar\{b\}\_\{\\rm sq\}^\{\\top\}\\Sigma\\bar\{b\}\_\{\\rm sq\}\>0, then, for all sufficiently largemm,

Rsq\(bMIR,sq\(m\)\)<Rsq\(bclean,sq\(m\)\)\.R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\}\\right\)<R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}\\right\)\.This condition holds automatically whenμ^=μ\\widehat\{\\mu\}=\\muandΣ^=Σ\\widehat\{\\Sigma\}=\\Sigma\. For logistic loss, ifμ⊤b¯log\>B24‖b¯log‖22\\mu^\{\\top\}\\bar\{b\}\_\{\\rm log\}\>\\frac\{B^\{2\}\}\{4\}\\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}^\{2\}, then, for all sufficiently largemm,

Rlog\(bMIR,log\(m\)\)<Rlog\(bclean,log\(m\)\)\.R\_\{\\rm log\}\\\!\\left\(b\_\{\{\\rm MIR\},\{\\rm log\}\}^\{\(m\)\}\\right\)<R\_\{\\rm log\}\\\!\\left\(b\_\{\{\\rm clean\},\{\\rm log\}\}^\{\(m\)\}\\right\)\.In particular, this logistic condition holds for sufficiently smallβ/ρb\\beta/\\rho\_\{b\}wheneverμ⊤μ^\>0\\mu^\{\\top\}\\widehat\{\\mu\}\>0\. Moreover, defining

Δsq,m:=Rsq\(bclean,sq\(m\)\)−Rsq\(bMIR,sq\(m\)\),\\Delta\_\{\{\\rm sq\},m\}:=R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}\\right\)\-R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\}\\right\),and

Δlog,m:=Rlog\(bclean,log\(m\)\)−Rlog\(bMIR,log\(m\)\),\\Delta\_\{\{\\rm log\},m\}:=R\_\{\\rm log\}\\\!\\left\(b\_\{\{\\rm clean\},\{\\rm log\}\}^\{\(m\)\}\\right\)\-R\_\{\\rm log\}\\\!\\left\(b\_\{\{\\rm MIR\},\{\\rm log\}\}^\{\(m\)\}\\right\),we have

Δsq,m→Rsq\(0\)−Rsq\(b¯sq\)\>0\\Delta\_\{\{\\rm sq\},m\}\\to R\_\{\\rm sq\}\(0\)\-R\_\{\\rm sq\}\(\\bar\{b\}\_\{\\rm sq\}\)\>0under the squared\-loss condition, and

Δlog,m→Rlog\(0\)−Rlog\(b¯log\)\>0\\Delta\_\{\{\\rm log\},m\}\\to R\_\{\\rm log\}\(0\)\-R\_\{\\rm log\}\(\\bar\{b\}\_\{\\rm log\}\)\>0under the logistic\-loss condition\.

###### Corollary C\.7\(Empirical Signal\)\.

Suppose\(Si,Yi\)\(S\_\{i\},Y\_\{i\}\)are independent,‖Si‖2≤B\\\|S\_\{i\}\\\|\_\{2\}\\leq B, andμ=𝔼\[YS\]≠0\\mu=\\mathbb\{E\}\[YS\]\\neq 0\. Then

ℙ\(μ⊤μ^\>0\)≥1−exp⁡\(−n‖μ‖222B2\)\.\\mathbb\{P\}\\\!\\left\(\\mu^\{\\top\}\\widehat\{\\mu\}\>0\\right\)\\geq 1\-\\exp\\\!\\left\(\-\\frac\{n\\\|\\mu\\\|\_\{2\}^\{2\}\}\{2B^\{2\}\}\\right\)\.In particular, with high probability, the empirical generalizable component is aligned with the population predictive direction\.

The previous results compare clean and MIR training in the limit\. The next result makes the dependence on model size explicit for a simplified squared\-loss objective\. This simplified objective replaces the full expected masked loss by the term that appears when the context\-specific component is hidden while the generalizable component remains visible\. It is not the full MIR objective, but it isolates the part of masking that forces prediction from the generalizable component\.

###### Theorem C\.8\(Increasing Benefit with Growing Model Size\)\.

Consider the squared\-loss objective

J^key,sq\(m\)\(w,b\):=J^ℓsq,clean\(m\)\(w,b\)\+β2n‖Y−𝐒b‖22,β\>0\.\\widehat\{J\}\_\{\{\\rm key\},\{\\rm sq\}\}^\{\(m\)\}\(w,b\):=\\widehat\{J\}\_\{\\ell\_\{\\rm sq\},\{\\rm clean\}\}^\{\(m\)\}\(w,b\)\+\\frac\{\\beta\}\{2n\}\\\|Y\-\\mathbf\{S\}b\\\|\_\{2\}^\{2\},\\qquad\\beta\>0\.Letbkey,sq\(m\)b\_\{\{\\rm key\},\{\\rm sq\}\}^\{\(m\)\}be itsbb\-coordinate minimizer, and define

Δkey,m:=Rsq\(bclean,sq\(m\)\)−Rsq\(bkey,sq\(m\)\)\.\\Delta\_\{\{\\rm key\},m\}:=R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}\\right\)\-R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm key\},\{\\rm sq\}\}^\{\(m\)\}\\right\)\.AssumeGm=mIn,μ^=μ,Σ^=Σ,μ≠0G\_\{m\}=mI\_\{n\},\\widehat\{\\mu\}=\\mu,\\widehat\{\\Sigma\}=\\Sigma,\\mu\\neq 0\. ThenΔkey,m\\Delta\_\{\{\\rm key\},m\}is strictly increasing inmm\. Moreover, let

Σ=Udiag\(λ1,…,λd\)U⊤,U⊤μ=\(μ1,…,μd\)⊤\.\\Sigma=U\{\\rm diag\}\(\\lambda\_\{1\},\\ldots,\\lambda\_\{d\}\)U^\{\\top\},\\penalty 10000\\ \\penalty 10000\\ U^\{\\top\}\\mu=\(\\mu\_\{1\},\\ldots,\\mu\_\{d\}\)^\{\\top\}\.For eachλj\>0\\lambda\_\{j\}\>0, define

κj:=βλjρb\.\\kappa\_\{j\}:=\\frac\{\\beta\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\.Then

limm→∞Δkey,m=∑λj\>0μj2λjκj\(κj\+2\)\(1\+κj\)2\>0\.\\lim\_\{m\\to\\infty\}\\Delta\_\{\{\\rm key\},m\}=\\sum\_\{\\lambda\_\{j\}\>0\}\\frac\{\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}\\frac\{\\kappa\_\{j\}\(\\kappa\_\{j\}\+2\)\}\{\(1\+\\kappa\_\{j\}\)^\{2\}\}\>0\.

Theorem[C\.8](https://arxiv.org/html/2606.06888#A3.Thmtheorem8)illustrates the increasing benefit of masking the context\-specific component as model size increases\. The conditionGm=mInG\_\{m\}=mI\_\{n\}is an idealized isotropic\-capacity assumption and is stronger than needed for the main intuition; it is used here to obtain a simple closed\-form expression and a monotonicity statement\. More general Gram matrices with growing eigenvalues would lead to a similar conclusion, although the closed\-form expression would be less transparent\. The assumptionsμ^=μ\\widehat\{\\mu\}=\\muandΣ^=Σ\\widehat\{\\Sigma\}=\\Sigmaremove finite\-sample realization error from the training sequences from the statement\. They ensure that the empirical generalizable signal in the training set is aligned with the population signal that determines validation risk\. Under these conditions, any difference between clean training and the masked objective comes from the use of context\-specific memorization\. In finite samples, these assumptions can be interpreted as a population\-aligned simplification: whennnis large,μ^\\widehat\{\\mu\}andΣ^\\widehat\{\\Sigma\}concentrate aroundμ\\muandΣ\\Sigma, so the same conclusion is stable up to small perturbation terms\. In what follows, we prove the theoretical results in this section\.

Throughout the proofs, we write

u:=Φmw∈ℝn\.u:=\\Phi\_\{m\}w\\in\\mathbb\{R\}^\{n\}\.WheneverGm=ΦmΦm⊤≻0G\_\{m\}=\\Phi\_\{m\}\\Phi\_\{m\}^\{\\top\}\\succ 0, everyu∈ℝnu\\in\\mathbb\{R\}^\{n\}is representable asΦmw\\Phi\_\{m\}w, and the minimum\-norm representative satisfies

minw:Φmw=u⁡‖w‖22=u⊤Gm−1u\.\\min\_\{w:\\Phi\_\{m\}w=u\}\\\|w\\\|\_\{2\}^\{2\}=u^\{\\top\}G\_\{m\}^\{\-1\}u\.Indeed,w⋆=Φm⊤Gm−1uw\_\{\\star\}=\\Phi\_\{m\}^\{\\top\}G\_\{m\}^\{\-1\}usatisfiesΦmw⋆=u\\Phi\_\{m\}w\_\{\\star\}=u\. For any other feasiblew=w⋆\+vw=w\_\{\\star\}\+v, we haveΦmv=0\\Phi\_\{m\}v=0, and hence

⟨w⋆,v⟩=u⊤Gm−1Φmv=0\.\\langle w\_\{\\star\},v\\rangle=u^\{\\top\}G\_\{m\}^\{\-1\}\\Phi\_\{m\}v=0\.Thus

‖w‖22=‖w⋆‖22\+‖v‖22≥‖w⋆‖22=u⊤Gm−1u\.\\\|w\\\|\_\{2\}^\{2\}=\\\|w\_\{\\star\}\\\|\_\{2\}^\{2\}\+\\\|v\\\|\_\{2\}^\{2\}\\geq\\\|w\_\{\\star\}\\\|\_\{2\}^\{2\}=u^\{\\top\}G\_\{m\}^\{\-1\}u\.Therefore the optimization over\(w,b\)\(w,b\)is equivalent to optimization over\(u,b\)\(u,b\), with regularization term

ρw2nu⊤Gm−1u\.\\frac\{\\rho\_\{w\}\}\{2n\}u^\{\\top\}G\_\{m\}^\{\-1\}u\.
### C\.1Proof of Theorem[C\.4](https://arxiv.org/html/2606.06888#A3.Thmtheorem4)

###### Proof\.

We first prove the squared\-loss claims\. In\(u,b\)\(u,b\)\-coordinates, the clean squared\-loss objective is

12n‖Y−u−𝐒b‖22\+ρw2nu⊤Gm−1u\+ρb2‖b‖22\.\\frac\{1\}\{2n\}\\\|Y\-u\-\\mathbf\{S\}b\\\|\_\{2\}^\{2\}\+\\frac\{\\rho\_\{w\}\}\{2n\}u^\{\\top\}G\_\{m\}^\{\-1\}u\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\}\.For fixedbb, the first\-order condition inuuis

u−\(Y−𝐒b\)\+ρwGm−1u=0\.u\-\(Y\-\\mathbf\{S\}b\)\+\\rho\_\{w\}G\_\{m\}^\{\-1\}u=0\.Therefore

u⋆\(b\)=\(Gm\+ρwIn\)−1Gm\(Y−𝐒b\)\.u^\{\\star\}\(b\)=\(G\_\{m\}\+\\rho\_\{w\}I\_\{n\}\)^\{\-1\}G\_\{m\}\(Y\-\\mathbf\{S\}b\)\.Profiling outuu, the clean squared\-loss objective becomes

12n\(Y−𝐒b\)⊤Tm\(Y−𝐒b\)\+ρb2‖b‖22,Tm:=ρw\(Gm\+ρwIn\)−1\.\\frac\{1\}\{2n\}\(Y\-\\mathbf\{S\}b\)^\{\\top\}T\_\{m\}\(Y\-\\mathbf\{S\}b\)\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\},\\penalty 10000\\ \\penalty 10000\\ T\_\{m\}:=\\rho\_\{w\}\(G\_\{m\}\+\\rho\_\{w\}I\_\{n\}\)^\{\-1\}\.Differentiating with respect tobbgives

bclean,sq\(m\)=\(ρbId\+1n𝐒⊤Tm𝐒\)−11n𝐒⊤TmY\.b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}=\\left\(\\rho\_\{b\}I\_\{d\}\+\\frac\{1\}\{n\}\\mathbf\{S\}^\{\\top\}T\_\{m\}\\mathbf\{S\}\\right\)^\{\-1\}\\frac\{1\}\{n\}\\mathbf\{S\}^\{\\top\}T\_\{m\}Y\.The eigenvalues ofTmT\_\{m\}are

ρwλj\(Gm\)\+ρw,\\frac\{\\rho\_\{w\}\}\{\\lambda\_\{j\}\(G\_\{m\}\)\+\\rho\_\{w\}\},and hence

‖Tm‖op≤ρwam\+ρw→0\.\\\|T\_\{m\}\\\|\_\{\\rm op\}\\leq\\frac\{\\rho\_\{w\}\}\{a\_\{m\}\+\\rho\_\{w\}\}\\to 0\.It follows that

bclean,sq\(m\)→0\.b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}\\to 0\.
We next consider MIR with squared loss\. Let

s0:=𝔼\[r2\]\.s\_\{0\}:=\\mathbb\{E\}\[r^\{2\}\]\.Up to the additive constantλs0\(2n\)−1‖Y‖22\\lambda s\_\{0\}\(2n\)^\{\-1\}\\\|Y\\\|\_\{2\}^\{2\}, the expected four\-case squared\-loss objective is

α12n‖Y−u−𝐒b‖22\+β12n‖Y−𝐒b‖22\+β12n‖Y−u‖22\\displaystyle\\alpha\\frac\{1\}\{2n\}\\\|Y\-u\-\\mathbf\{S\}b\\\|\_\{2\}^\{2\}\+\\beta\\frac\{1\}\{2n\}\\\|Y\-\\mathbf\{S\}b\\\|\_\{2\}^\{2\}\+\\beta\\frac\{1\}\{2n\}\\\|Y\-u\\\|\_\{2\}^\{2\}\+ρw2nu⊤Gm−1u\+ρb2‖b‖22\.\\displaystyle\\hskip 30\.00005pt\+\\frac\{\\rho\_\{w\}\}\{2n\}u^\{\\top\}G\_\{m\}^\{\-1\}u\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\}\.The first\-order condition inuuis

α\(u\+𝐒b−Y\)\+β\(u−Y\)\+ρwGm−1u=0\.\\alpha\(u\+\\mathbf\{S\}b\-Y\)\+\\beta\(u\-Y\)\+\\rho\_\{w\}G\_\{m\}^\{\-1\}u=0\.Sinceδ=α\+β\\delta=\\alpha\+\\beta, this gives

u⋆\(b\)=Mm\(δY−α𝐒b\),Mm:=\(δIn\+ρwGm−1\)−1\.u^\{\\star\}\(b\)=M\_\{m\}\(\\delta Y\-\\alpha\\mathbf\{S\}b\),\\penalty 10000\\ \\penalty 10000\\ M\_\{m\}:=\(\\delta I\_\{n\}\+\\rho\_\{w\}G\_\{m\}^\{\-1\}\)^\{\-1\}\.The first\-order condition inbbis

αn𝐒⊤\(u\+𝐒b−Y\)\+βn𝐒⊤\(𝐒b−Y\)\+ρbb=0\.\\frac\{\\alpha\}\{n\}\\mathbf\{S\}^\{\\top\}\(u\+\\mathbf\{S\}b\-Y\)\+\\frac\{\\beta\}\{n\}\\mathbf\{S\}^\{\\top\}\(\\mathbf\{S\}b\-Y\)\+\\rho\_\{b\}b=0\.Equivalently,

αn𝐒⊤u\+\(δΣ^\+ρbId\)b−δμ^=0\.\\frac\{\\alpha\}\{n\}\\mathbf\{S\}^\{\\top\}u\+\(\\delta\\widehat\{\\Sigma\}\+\\rho\_\{b\}I\_\{d\}\)b\-\\delta\\widehat\{\\mu\}=0\.Substitutingu⋆\(b\)=Mm\(δY−α𝐒b\)u^\{\\star\}\(b\)=M\_\{m\}\(\\delta Y\-\\alpha\\mathbf\{S\}b\)yields

bMIR,sq\(m\)=\(ρbId\+δΣ^−α2n𝐒⊤Mm𝐒\)−1\(δμ^−αδn𝐒⊤MmY\)\.b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\}=\\left\(\\rho\_\{b\}I\_\{d\}\+\\delta\\widehat\{\\Sigma\}\-\\frac\{\\alpha^\{2\}\}\{n\}\\mathbf\{S\}^\{\\top\}M\_\{m\}\\mathbf\{S\}\\right\)^\{\-1\}\\left\(\\delta\\widehat\{\\mu\}\-\\frac\{\\alpha\\delta\}\{n\}\\mathbf\{S\}^\{\\top\}M\_\{m\}Y\\right\)\.Because‖Gm−1‖op→0\\\|G\_\{m\}^\{\-1\}\\\|\_\{\\rm op\}\\to 0, we have

Mm→δ−1InM\_\{m\}\\to\\delta^\{\-1\}I\_\{n\}in operator norm\. Therefore

1n𝐒⊤MmY→1δμ^,1n𝐒⊤Mm𝐒→1δΣ^\.\\frac\{1\}\{n\}\\mathbf\{S\}^\{\\top\}M\_\{m\}Y\\to\\frac\{1\}\{\\delta\}\\widehat\{\\mu\},\\penalty 10000\\ \\penalty 10000\\ \\frac\{1\}\{n\}\\mathbf\{S\}^\{\\top\}M\_\{m\}\\mathbf\{S\}\\to\\frac\{1\}\{\\delta\}\\widehat\{\\Sigma\}\.It follows that

bMIR,sq\(m\)→β\(ρbId\+ηΣ^\)−1μ^=b¯sq\.b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\}\\to\\beta\(\\rho\_\{b\}I\_\{d\}\+\\eta\\widehat\{\\Sigma\}\)^\{\-1\}\\widehat\{\\mu\}=\\bar\{b\}\_\{\\rm sq\}\.Ifμ^≠0\\widehat\{\\mu\}\\neq 0, then

μ^⊤b¯sq=βμ^⊤\(ρbId\+ηΣ^\)−1μ^\>0,\\widehat\{\\mu\}^\{\\top\}\\bar\{b\}\_\{\\rm sq\}=\\beta\\widehat\{\\mu\}^\{\\top\}\(\\rho\_\{b\}I\_\{d\}\+\\eta\\widehat\{\\Sigma\}\)^\{\-1\}\\widehat\{\\mu\}\>0,becauseρbId\+ηΣ^≻0\\rho\_\{b\}I\_\{d\}\+\\eta\\widehat\{\\Sigma\}\\succ 0\. Henceb¯sq≠0\\bar\{b\}\_\{\\rm sq\}\\neq 0\.

We now prove the logistic\-loss claims\. Let

g\(z\):=log⁡\(1\+e−z\)\.g\(z\):=\\log\(1\+e^\{\-z\}\)\.Chooseτm→∞\\tau\_\{m\}\\to\\inftysuch thatτm2/am→0\\tau\_\{m\}^\{2\}/a\_\{m\}\\to 0\. For clean logistic training, evaluate the objective atu=τmYu=\\tau\_\{m\}Yandb=0b=0\. Then

g\(Yiui\)=g\(τm\)→0,g\(Y\_\{i\}u\_\{i\}\)=g\(\\tau\_\{m\}\)\\to 0,and

ρw2nu⊤Gm−1u≤ρw2τm2am→0\.\\frac\{\\rho\_\{w\}\}\{2n\}u^\{\\top\}G\_\{m\}^\{\-1\}u\\leq\\frac\{\\rho\_\{w\}\}\{2\}\\frac\{\\tau\_\{m\}^\{2\}\}\{a\_\{m\}\}\\to 0\.Hence the minimum clean logistic objective converges to zero\. Since the objective is nonnegative and contains the termρb‖b‖22/2\\rho\_\{b\}\\\|b\\\|\_\{2\}^\{2\}/2, every clean logistic minimizer satisfies

bclean,log\(m\)→0\.b\_\{\{\\rm clean\},\{\\rm log\}\}^\{\(m\)\}\\to 0\.
For MIR logistic training, define

Flog\(b\):=β1n∑i=1ng\(Yib⊤Si\)\+ρb2‖b‖22\.F\_\{\\rm log\}\(b\):=\\beta\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\(Y\_\{i\}b^\{\\top\}S\_\{i\}\)\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\}\.This function is strongly convex and therefore has a unique minimizerb¯log\\bar\{b\}\_\{\\rm log\}\. The expected four\-case MIR logistic objective in\(u,b\)\(u,b\)\-coordinates is

J^ℓlog,MIR\(m\)\(u,b\):=\\displaystyle\\widehat\{J\}\_\{\\ell\_\{\\rm log\},\{\\rm MIR\}\}^\{\(m\)\}\(u,b\)=α1n∑i=1ng\(Yi\(ui\+b⊤Si\)\)\+β1n∑i=1ng\(Yib⊤Si\)\+β1n∑i=1ng\(Yiui\)\\displaystyle\\penalty 10000\\ \\alpha\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\(Y\_\{i\}\(u\_\{i\}\+b^\{\\top\}S\_\{i\}\)\)\+\\beta\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\(Y\_\{i\}b^\{\\top\}S\_\{i\}\)\+\\beta\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\(Y\_\{i\}u\_\{i\}\)\+λs0log⁡2\+ρw2nu⊤Gm−1u\+ρb2‖b‖22\.\\displaystyle\\hskip 30\.00005pt\+\\lambda s\_\{0\}\\log 2\+\\frac\{\\rho\_\{w\}\}\{2n\}u^\{\\top\}G\_\{m\}^\{\-1\}u\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\}\.All terms exceptFlog\(b\)F\_\{\\rm log\}\(b\)and the constantλs0log⁡2\\lambda s\_\{0\}\\log 2are nonnegative\. Hence, for every\(u,b\)\(u,b\),

J^ℓlog,MIR\(m\)\(u,b\)≥Flog\(b\)\+λs0log⁡2\.\\widehat\{J\}\_\{\\ell\_\{\\rm log\},\{\\rm MIR\}\}^\{\(m\)\}\(u,b\)\\geq F\_\{\\rm log\}\(b\)\+\\lambda s\_\{0\}\\log 2\.Now evaluate the MIR logistic objective atb=b¯logb=\\bar\{b\}\_\{\\rm log\}andu=τmYu=\\tau\_\{m\}Y\. Then the context\-specific\-only margin isYiui=τmY\_\{i\}u\_\{i\}=\\tau\_\{m\}, while the margin satisfies

Yi\(ui\+b¯log⊤Si\)=τm\+Yib¯log⊤Si≥τm−B‖b¯log‖2→∞\.Y\_\{i\}\(u\_\{i\}\+\\bar\{b\}\_\{\\rm log\}^\{\\top\}S\_\{i\}\)=\\tau\_\{m\}\+Y\_\{i\}\\bar\{b\}\_\{\\rm log\}^\{\\top\}S\_\{i\}\\geq\\tau\_\{m\}\-B\\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}\\to\\infty\.Thus the corresponding logistic losses vanish, and the context\-specific regularization again tends to zero as we chooseτm\\tau\_\{m\}such thatτm2/am→0\\tau\_\{m\}^\{2\}/a\_\{m\}\\to 0\. Therefore

infu,bJ^ℓlog,MIR\(m\)\(u,b\)≤Flog\(b¯log\)\+λs0log⁡2\+o\(1\)\.\\inf\_\{u,b\}\\widehat\{J\}\_\{\\ell\_\{\\rm log\},\{\\rm MIR\}\}^\{\(m\)\}\(u,b\)\\leq F\_\{\\rm log\}\(\\bar\{b\}\_\{\\rm log\}\)\+\\lambda s\_\{0\}\\log 2\+o\(1\)\.Combining the lower and upper bounds gives

Flog\(bMIR,log\(m\)\)≤Flog\(b¯log\)\+o\(1\)\.F\_\{\\rm log\}\\\!\\left\(b\_\{\{\\rm MIR\},\{\\rm log\}\}^\{\(m\)\}\\right\)\\leq F\_\{\\rm log\}\(\\bar\{b\}\_\{\\rm log\}\)\+o\(1\)\.By strong convexity ofFlogF\_\{\\rm log\},

bMIR,log\(m\)→b¯log\.b\_\{\{\\rm MIR\},\{\\rm log\}\}^\{\(m\)\}\\to\\bar\{b\}\_\{\\rm log\}\.
Finally,

∇Flog\(0\)=−β2μ^\.\\nabla F\_\{\\rm log\}\(0\)=\-\\frac\{\\beta\}\{2\}\\widehat\{\\mu\}\.Thus, ifμ^≠0\\widehat\{\\mu\}\\neq 0, zero is not the minimizer andb¯log≠0\\bar\{b\}\_\{\\rm log\}\\neq 0\. At the minimizerb¯log\\bar\{b\}\_\{\\rm log\}, the first\-order condition forFlogF\_\{\\rm log\}gives

0=β1n∑i=1ng′\(Yib¯log⊤Si\)YiSi\+ρbb¯log\.0=\\beta\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g^\{\\prime\}\(Y\_\{i\}\\bar\{b\}\_\{\\rm log\}^\{\\top\}S\_\{i\}\)Y\_\{i\}S\_\{i\}\+\\rho\_\{b\}\\bar\{b\}\_\{\\rm log\}\.Equivalently,

ρbb¯log=−β1n∑i=1ng′\(Yib¯log⊤Si\)YiSi\.\\rho\_\{b\}\\bar\{b\}\_\{\\rm log\}=\-\\beta\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g^\{\\prime\}\(Y\_\{i\}\\bar\{b\}\_\{\\rm log\}^\{\\top\}S\_\{i\}\)Y\_\{i\}S\_\{i\}\.Taking Euclidean norms and using the triangle inequality,

ρb‖b¯log‖2\\displaystyle\\rho\_\{b\}\\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}≤β1n∑i=1n\|g′\(Yib¯log⊤Si\)\|\|Yi\|‖Si‖2\\displaystyle\\leq\\beta\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\left\|g^\{\\prime\}\(Y\_\{i\}\\bar\{b\}\_\{\\rm log\}^\{\\top\}S\_\{i\}\)\\right\|\|Y\_\{i\}\|\\\|S\_\{i\}\\\|\_\{2\}≤βB,\\displaystyle\\leq\\beta B,because\|Yi\|=1\|Y\_\{i\}\|=1,‖Si‖2≤B\\\|S\_\{i\}\\\|\_\{2\}\\leq B, and\|g′\(z\)\|≤1\|g^\{\\prime\}\(z\)\|\\leq 1\. Therefore

‖b¯log‖2≤βBρb\.\\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}\\leq\\frac\{\\beta B\}\{\\rho\_\{b\}\}\.
∎

### C\.2Proof of Theorem[C\.6](https://arxiv.org/html/2606.06888#A3.Thmtheorem6)

###### Proof\.

By Assumption[C\.5](https://arxiv.org/html/2606.06888#A3.Thmtheorem5), validation prediction scores areb⊤Sb^\{\\top\}S\. Therefore validation risks depend only on the coefficientbbof the generalizable component\.

For squared loss,

Rsq\(b\)−Rsq\(0\)=−2μ⊤b\+b⊤Σb\.R\_\{\\rm sq\}\(b\)\-R\_\{\\rm sq\}\(0\)=\-2\\mu^\{\\top\}b\+b^\{\\top\}\\Sigma b\.ThusRsq\(b\)<Rsq\(0\)R\_\{\\rm sq\}\(b\)<R\_\{\\rm sq\}\(0\)whenever

2μ⊤b−b⊤Σb\>0\.2\\mu^\{\\top\}b\-b^\{\\top\}\\Sigma b\>0\.By Theorem[C\.4](https://arxiv.org/html/2606.06888#A3.Thmtheorem4),

bclean,sq\(m\)→0,bMIR,sq\(m\)→b¯sq\.b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}\\to 0,\\penalty 10000\\ \\penalty 10000\\ b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\}\\to\\bar\{b\}\_\{\\rm sq\}\.If

2μ⊤b¯sq−b¯sq⊤Σb¯sq\>0,2\\mu^\{\\top\}\\bar\{b\}\_\{\\rm sq\}\-\\bar\{b\}\_\{\\rm sq\}^\{\\top\}\\Sigma\\bar\{b\}\_\{\\rm sq\}\>0,then continuity gives

Rsq\(bMIR,sq\(m\)\)<Rsq\(bclean,sq\(m\)\)R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm MIR\},\{\\rm sq\}\}^\{\(m\)\}\\right\)<R\_\{\\rm sq\}\\\!\\left\(b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}\\right\)for all sufficiently largemm\.

We next show that the squared\-loss condition holds automatically whenμ^=μ\\widehat\{\\mu\}=\\muandΣ^=Σ\\widehat\{\\Sigma\}=\\Sigma\. In this case,

b¯sq=β\(ρbId\+ηΣ\)−1μ\.\\bar\{b\}\_\{\\rm sq\}=\\beta\(\\rho\_\{b\}I\_\{d\}\+\\eta\\Sigma\)^\{\-1\}\\mu\.Let

A:=\(ρbId\+ηΣ\)−1\.A:=\(\\rho\_\{b\}I\_\{d\}\+\\eta\\Sigma\)^\{\-1\}\.SinceAAandΣ\\Sigmacommute,

b¯sq⊤Σb¯sq=β2μ⊤AΣAμ≤βηβμ⊤Aμ=βημ⊤b¯sq\.\\bar\{b\}\_\{\\rm sq\}^\{\\top\}\\Sigma\\bar\{b\}\_\{\\rm sq\}=\\beta^\{2\}\\mu^\{\\top\}A\\Sigma A\\mu\\leq\\frac\{\\beta\}\{\\eta\}\\,\\beta\\mu^\{\\top\}A\\mu=\\frac\{\\beta\}\{\\eta\}\\,\\mu^\{\\top\}\\bar\{b\}\_\{\\rm sq\}\.Because

η=β\(δ\+α\)δ\>β,\\eta=\\frac\{\\beta\(\\delta\+\\alpha\)\}\{\\delta\}\>\\beta,we haveβ/η<1\\beta/\\eta<1\. Also,

μ⊤b¯sq=βμ⊤Aμ\>0,\\mu^\{\\top\}\\bar\{b\}\_\{\\rm sq\}=\\beta\\mu^\{\\top\}A\\mu\>0,sinceμ≠0\\mu\\neq 0andA≻0A\\succ 0\. Therefore

2μ⊤b¯sq−b¯sq⊤Σb¯sq\>μ⊤b¯sq\>0\.2\\mu^\{\\top\}\\bar\{b\}\_\{\\rm sq\}\-\\bar\{b\}\_\{\\rm sq\}^\{\\top\}\\Sigma\\bar\{b\}\_\{\\rm sq\}\>\\mu^\{\\top\}\\bar\{b\}\_\{\\rm sq\}\>0\.
For logistic loss,

∇Rlog\(0\)=−12μ\.\\nabla R\_\{\\rm log\}\(0\)=\-\\frac\{1\}\{2\}\\mu\.Moreover, withσ\(t\)=\(1\+e−t\)−1\\sigma\(t\)=\(1\+e^\{\-t\}\)^\{\-1\}, the Hessian satisfies

∇2Rlog\(b\)=𝔼\[σ\(Yb⊤S\)σ\(−Yb⊤S\)SS⊤\]⪯14𝔼\[SS⊤\]⪯B24Id\.\\nabla^\{2\}R\_\{\\rm log\}\(b\)=\\mathbb\{E\}\\left\[\\sigma\(Yb^\{\\top\}S\)\\sigma\(\-Yb^\{\\top\}S\)SS^\{\\top\}\\right\]\\preceq\\frac\{1\}\{4\}\\mathbb\{E\}\[SS^\{\\top\}\]\\preceq\\frac\{B^\{2\}\}\{4\}I\_\{d\}\.Hence Taylor’s expansion gives

Rlog\(b\)≤Rlog\(0\)−12μ⊤b\+B28‖b‖22\.R\_\{\\rm log\}\(b\)\\leq R\_\{\\rm log\}\(0\)\-\\frac\{1\}\{2\}\\mu^\{\\top\}b\+\\frac\{B^\{2\}\}\{8\}\\\|b\\\|\_\{2\}^\{2\}\.ThusRlog\(b\)<Rlog\(0\)R\_\{\\rm log\}\(b\)<R\_\{\\rm log\}\(0\)whenever

μ⊤b\>B24‖b‖22\.\\mu^\{\\top\}b\>\\frac\{B^\{2\}\}\{4\}\\\|b\\\|\_\{2\}^\{2\}\.Applying this condition atb=b¯logb=\\bar\{b\}\_\{\\rm log\}, and using Theorem[C\.4](https://arxiv.org/html/2606.06888#A3.Thmtheorem4), gives

Rlog\(bMIR,log\(m\)\)<Rlog\(bclean,log\(m\)\)R\_\{\\rm log\}\\\!\\left\(b\_\{\{\\rm MIR\},\{\\rm log\}\}^\{\(m\)\}\\right\)<R\_\{\\rm log\}\\\!\\left\(b\_\{\{\\rm clean\},\{\\rm log\}\}^\{\(m\)\}\\right\)for all sufficiently largemm\.

It remains to justify the stated sufficient condition for logistic loss\. Let

Lemp\(b\):=1n∑i=1ng\(Yib⊤Si\),g\(z\):=log⁡\(1\+e−z\)\.L\_\{\\rm emp\}\(b\):=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}g\(Y\_\{i\}b^\{\\top\}S\_\{i\}\),\\penalty 10000\\ \\penalty 10000\\ g\(z\):=\\log\(1\+e^\{\-z\}\)\.The minimizerb¯log\\bar\{b\}\_\{\\rm log\}satisfies

ρbb¯log=−β∇Lemp\(b¯log\)\.\\rho\_\{b\}\\bar\{b\}\_\{\\rm log\}=\-\\beta\\nabla L\_\{\\rm emp\}\(\\bar\{b\}\_\{\\rm log\}\)\.Since‖Si‖2≤B\\\|S\_\{i\}\\\|\_\{2\}\\leq B, the gradient∇Lemp\\nabla L\_\{\\rm emp\}is Lipschitz onℝd\\mathbb\{R\}^\{d\}, and

∇Lemp\(0\)=−12μ^\.\\nabla L\_\{\\rm emp\}\(0\)=\-\\frac\{1\}\{2\}\\widehat\{\\mu\}\.Also, from the optimality equation and‖∇Lemp\(b\)‖2≤B\\\|\\nabla L\_\{\\rm emp\}\(b\)\\\|\_\{2\}\\leq B,

‖b¯log‖2≤βBρb\.\\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}\\leq\\frac\{\\beta B\}\{\\rho\_\{b\}\}\.Therefore, asβ/ρb→0\\beta/\\rho\_\{b\}\\to 0,

b¯log=β2ρbμ^\+o\(β/ρb\)\.\\bar\{b\}\_\{\\rm log\}=\\frac\{\\beta\}\{2\\rho\_\{b\}\}\\widehat\{\\mu\}\+o\(\\beta/\\rho\_\{b\}\)\.Ifμ⊤μ^\>0\\mu^\{\\top\}\\widehat\{\\mu\}\>0, then

μ⊤b¯log=β2ρbμ⊤μ^\+o\(β/ρb\),\\mu^\{\\top\}\\bar\{b\}\_\{\\rm log\}=\\frac\{\\beta\}\{2\\rho\_\{b\}\}\\mu^\{\\top\}\\widehat\{\\mu\}\+o\(\\beta/\\rho\_\{b\}\),whereas

‖b¯log‖22=O\(\(β/ρb\)2\)\.\\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}^\{2\}=O\(\(\\beta/\\rho\_\{b\}\)^\{2\}\)\.Hence, for sufficiently smallβ/ρb\\beta/\\rho\_\{b\},

μ⊤b¯log\>B24‖b¯log‖22\.\\mu^\{\\top\}\\bar\{b\}\_\{\\rm log\}\>\\frac\{B^\{2\}\}\{4\}\\\|\\bar\{b\}\_\{\\rm log\}\\\|\_\{2\}^\{2\}\.
The asymptotic gain results follow from the same convergence and continuity:

Δsq,m→Rsq\(0\)−Rsq\(b¯sq\)\>0,\\Delta\_\{\{\\rm sq\},m\}\\to R\_\{\\rm sq\}\(0\)\-R\_\{\\rm sq\}\(\\bar\{b\}\_\{\\rm sq\}\)\>0,and

Δlog,m→Rlog\(0\)−Rlog\(b¯log\)\>0\.\\Delta\_\{\{\\rm log\},m\}\\to R\_\{\\rm log\}\(0\)\-R\_\{\\rm log\}\(\\bar\{b\}\_\{\\rm log\}\)\>0\.∎

### C\.3Proof of Corollary[C\.7](https://arxiv.org/html/2606.06888#A3.Thmtheorem7)

###### Proof\.

Let

a:=μ‖μ‖2\.a:=\\frac\{\\mu\}\{\\\|\\mu\\\|\_\{2\}\}\.Then

a⊤μ^=1n∑i=1nYia⊤Si\.a^\{\\top\}\\widehat\{\\mu\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}Y\_\{i\}a^\{\\top\}S\_\{i\}\.The summands satisfy

\|Yia⊤Si\|≤B,𝔼\[Yia⊤Si\]=a⊤μ=‖μ‖2\.\|Y\_\{i\}a^\{\\top\}S\_\{i\}\|\\leq B,\\penalty 10000\\ \\penalty 10000\\ \\mathbb\{E\}\[Y\_\{i\}a^\{\\top\}S\_\{i\}\]=a^\{\\top\}\\mu=\\\|\\mu\\\|\_\{2\}\.Hoeffding’s inequality\[Vershynin,[2018](https://arxiv.org/html/2606.06888#bib.bib47)\]gives

ℙ\(a⊤μ^≤0\)=ℙ\(a⊤μ^−‖μ‖2≤−‖μ‖2\)≤exp⁡\(−n‖μ‖222B2\)\.\\mathbb\{P\}\(a^\{\\top\}\\widehat\{\\mu\}\\leq 0\)=\\mathbb\{P\}\(a^\{\\top\}\\widehat\{\\mu\}\-\\\|\\mu\\\|\_\{2\}\\leq\-\\\|\\mu\\\|\_\{2\}\)\\leq\\exp\\\!\\left\(\-\\frac\{n\\\|\\mu\\\|\_\{2\}^\{2\}\}\{2B^\{2\}\}\\right\)\.Sincea⊤μ^\>0a^\{\\top\}\\widehat\{\\mu\}\>0is equivalent toμ⊤μ^\>0\\mu^\{\\top\}\\widehat\{\\mu\}\>0, the result follows\. ∎

### C\.4Proof of Theorem[C\.8](https://arxiv.org/html/2606.06888#A3.Thmtheorem8)

###### Proof\.

UnderGm=mInG\_\{m\}=mI\_\{n\},

Tm=ρw\(Gm\+ρwIn\)−1=ρwm\+ρwIn\.T\_\{m\}=\\rho\_\{w\}\(G\_\{m\}\+\\rho\_\{w\}I\_\{n\}\)^\{\-1\}=\\frac\{\\rho\_\{w\}\}\{m\+\\rho\_\{w\}\}I\_\{n\}\.Write

tm:=ρwm\+ρw\.t\_\{m\}:=\\frac\{\\rho\_\{w\}\}\{m\+\\rho\_\{w\}\}\.Using the profiled squared\-loss formula from the proof of Theorem[C\.4](https://arxiv.org/html/2606.06888#A3.Thmtheorem4), and usingμ^=μ\\widehat\{\\mu\}=\\muandΣ^=Σ\\widehat\{\\Sigma\}=\\Sigma, we obtain

bclean,sq\(m\)=tm\(ρbId\+tmΣ\)−1μ\.b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}=t\_\{m\}\(\\rho\_\{b\}I\_\{d\}\+t\_\{m\}\\Sigma\)^\{\-1\}\\mu\.For the key\-ablating objective, profiling outuugives

12n\(Y−𝐒b\)⊤Tm\(Y−𝐒b\)\+β2n‖Y−𝐒b‖22\+ρb2‖b‖22\.\\frac\{1\}\{2n\}\(Y\-\\mathbf\{S\}b\)^\{\\top\}T\_\{m\}\(Y\-\\mathbf\{S\}b\)\+\\frac\{\\beta\}\{2n\}\\\|Y\-\\mathbf\{S\}b\\\|\_\{2\}^\{2\}\+\\frac\{\\rho\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\}\.Differentiating with respect tobbyields

bkey,sq\(m\)=\(tm\+β\)\(ρbId\+\(tm\+β\)Σ\)−1μ\.b\_\{\{\\rm key\},\{\\rm sq\}\}^\{\(m\)\}=\(t\_\{m\}\+\\beta\)\(\\rho\_\{b\}I\_\{d\}\+\(t\_\{m\}\+\\beta\)\\Sigma\)^\{\-1\}\\mu\.
Let

Σ=Udiag\(λ1,…,λd\)U⊤,U⊤μ=\(μ1,…,μd\)⊤\.\\Sigma=U\{\\rm diag\}\(\\lambda\_\{1\},\\ldots,\\lambda\_\{d\}\)U^\{\\top\},\\penalty 10000\\ \\penalty 10000\\ U^\{\\top\}\\mu=\(\\mu\_\{1\},\\ldots,\\mu\_\{d\}\)^\{\\top\}\.Ifλj=0\\lambda\_\{j\}=0, thenμj=0\\mu\_\{j\}=0sinceμ=𝔼\[YS\]\\mu=\{\\mathbb\{E\}\}\[YS\]andΣ=𝔼\[SS⊤\]\\Sigma=\{\\mathbb\{E\}\}\[SS^\{\\top\}\]\. Indeed,λj=0\\lambda\_\{j\}=0implies that the corresponding projection ofSSis zero almost surely, and hence its correlation withYYis also zero\. Thus only terms withλj\>0\\lambda\_\{j\}\>0contribute to the risk\.

Forα0\>0\\alpha\_\{0\}\>0, define

b\(α0\):=α0\(ρbId\+α0Σ\)−1μ\.b\(\\alpha\_\{0\}\):=\\alpha\_\{0\}\(\\rho\_\{b\}I\_\{d\}\+\\alpha\_\{0\}\\Sigma\)^\{\-1\}\\mu\.In the eigenbasis ofΣ\\Sigma, thejj\-th coordinate is, forλj\>0\\lambda\_\{j\}\>0,

bj\(α0\)=μjλjα0λj/ρb1\+α0λj/ρb\.b\_\{j\}\(\\alpha\_\{0\}\)=\\frac\{\\mu\_\{j\}\}\{\\lambda\_\{j\}\}\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}/\\rho\_\{b\}\}\{1\+\\alpha\_\{0\}\\lambda\_\{j\}/\\rho\_\{b\}\}\.Let

s\(x\):=x1\+x\.s\(x\):=\\frac\{x\}\{1\+x\}\.The reduction in squared risk from usingb\(α0\)b\(\\alpha\_\{0\}\)instead of0is

Rsq\(0\)−Rsq\(b\(α0\)\)\\displaystyle R\_\{\\rm sq\}\(0\)\-R\_\{\\rm sq\}\(b\(\\alpha\_\{0\}\)\)=2μ⊤b\(α0\)−b\(α0\)⊤Σb\(α0\)\.\\displaystyle=2\\mu^\{\\top\}b\(\\alpha\_\{0\}\)\-b\(\\alpha\_\{0\}\)^\{\\top\}\\Sigma b\(\\alpha\_\{0\}\)\.WriteU⊤b\(α0\)=\(b1\(α0\),…,bd\(α0\)\)⊤U^\{\\top\}b\(\\alpha\_\{0\}\)=\(b\_\{1\}\(\\alpha\_\{0\}\),\\ldots,b\_\{d\}\(\\alpha\_\{0\}\)\)^\{\\top\}\. In the eigenbasis ofΣ\\Sigma,

μ⊤b\(α0\)=∑j=1dμjbj\(α0\),b\(α0\)⊤Σb\(α0\)=∑j=1dλjbj\(α0\)2\.\\mu^\{\\top\}b\(\\alpha\_\{0\}\)=\\sum\_\{j=1\}^\{d\}\\mu\_\{j\}b\_\{j\}\(\\alpha\_\{0\}\),\\qquad b\(\\alpha\_\{0\}\)^\{\\top\}\\Sigma b\(\\alpha\_\{0\}\)=\\sum\_\{j=1\}^\{d\}\\lambda\_\{j\}b\_\{j\}\(\\alpha\_\{0\}\)^\{2\}\.Therefore

Rsq\(0\)−Rsq\(b\(α0\)\)\\displaystyle R\_\{\\rm sq\}\(0\)\-R\_\{\\rm sq\}\(b\(\\alpha\_\{0\}\)\)=∑j=1d\{2μjbj\(α0\)−λjbj\(α0\)2\}\.\\displaystyle=\\sum\_\{j=1\}^\{d\}\\left\\\{2\\mu\_\{j\}b\_\{j\}\(\\alpha\_\{0\}\)\-\\lambda\_\{j\}b\_\{j\}\(\\alpha\_\{0\}\)^\{2\}\\right\\\}\.Ifλj=0\\lambda\_\{j\}=0, thenμj=0\\mu\_\{j\}=0, so the corresponding term is zero\. Thus only the terms withλj\>0\\lambda\_\{j\}\>0remain\. For suchjj,

bj\(α0\)=μjλjs\(α0λjρb\),s\(x\):=x1\+x\.b\_\{j\}\(\\alpha\_\{0\}\)=\\frac\{\\mu\_\{j\}\}\{\\lambda\_\{j\}\}s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\),\\qquad s\(x\):=\\frac\{x\}\{1\+x\}\.Substituting this expression gives

2μjbj\(α0\)−λjbj\(α0\)2\\displaystyle 2\\mu\_\{j\}b\_\{j\}\(\\alpha\_\{0\}\)\-\\lambda\_\{j\}b\_\{j\}\(\\alpha\_\{0\}\)^\{2\}=2μjμjλjs\(α0λjρb\)−λj\[μjλjs\(α0λjρb\)\]2\\displaystyle=2\\mu\_\{j\}\\frac\{\\mu\_\{j\}\}\{\\lambda\_\{j\}\}s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)\-\\lambda\_\{j\}\\left\[\\frac\{\\mu\_\{j\}\}\{\\lambda\_\{j\}\}s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)\\right\]^\{2\}=2μj2λjs\(α0λjρb\)−μj2λjs2\(α0λjρb\)\\displaystyle=\\frac\{2\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)\-\\frac\{\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}s^\{2\}\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)=μj2λjs\(α0λjρb\)\[2−s\(α0λjρb\)\]\.\\displaystyle=\\frac\{\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)\\left\[2\-s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)\\right\]\.Hence

Rsq\(0\)−Rsq\(b\(α0\)\)=∑λj\>0μj2λjs\(α0λjρb\)\[2−s\(α0λjρb\)\]\.R\_\{\\rm sq\}\(0\)\-R\_\{\\rm sq\}\(b\(\\alpha\_\{0\}\)\)=\\sum\_\{\\lambda\_\{j\}\>0\}\\frac\{\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)\\left\[2\-s\\\!\\left\(\\frac\{\\alpha\_\{0\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\\right\)\\right\]\.Since

bclean,sq\(m\)=b\(tm\),bkey,sq\(m\)=b\(tm\+β\),b\_\{\{\\rm clean\},\{\\rm sq\}\}^\{\(m\)\}=b\(t\_\{m\}\),\\penalty 10000\\ \\penalty 10000\\ b\_\{\{\\rm key\},\{\\rm sq\}\}^\{\(m\)\}=b\(t\_\{m\}\+\\beta\),the gainΔkey,m\\Delta\_\{\{\\rm key\},m\}is a sum overλj\>0\\lambda\_\{j\}\>0of terms of the form

μj2λj\[s\(x\+κj\)\{2−s\(x\+κj\)\}−s\(x\)\{2−s\(x\)\}\],\\frac\{\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}\\left\[s\(x\+\\kappa\_\{j\}\)\\\{2\-s\(x\+\\kappa\_\{j\}\)\\\}\-s\(x\)\\\{2\-s\(x\)\\\}\\right\],where

x=tmλjρb,κj=βλjρb\>0\.x=\\frac\{t\_\{m\}\\lambda\_\{j\}\}\{\\rho\_\{b\}\},\\penalty 10000\\ \\penalty 10000\\ \\kappa\_\{j\}=\\frac\{\\beta\\lambda\_\{j\}\}\{\\rho\_\{b\}\}\>0\.Note that

s\(x\+κ\)−s\(x\)=κ\(1\+x\)\(1\+x\+κ\)s\(x\+\\kappa\)\-s\(x\)=\\frac\{\\kappa\}\{\(1\+x\)\(1\+x\+\\kappa\)\}and

2−s\(x\+κ\)−s\(x\)=κ\+2x\+2\(1\+x\)\(1\+x\+κ\)\.2\-s\(x\+\\kappa\)\-s\(x\)=\\frac\{\\kappa\+2x\+2\}\{\(1\+x\)\(1\+x\+\\kappa\)\}\.We have that each nonzero spectral contribution equals

μj2λjκj\(κj\+2x\+2\)\(1\+x\)2\(1\+x\+κj\)2\.\\frac\{\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}\\frac\{\\kappa\_\{j\}\(\\kappa\_\{j\}\+2x\+2\)\}\{\(1\+x\)^\{2\}\(1\+x\+\\kappa\_\{j\}\)^\{2\}\}\.For fixedκ\>0\\kappa\>0, define

Fκ\(x\):=κ\(κ\+2x\+2\)\(1\+x\)2\(1\+x\+κ\)2\.F\_\{\\kappa\}\(x\):=\\frac\{\\kappa\(\\kappa\+2x\+2\)\}\{\(1\+x\)^\{2\}\(1\+x\+\\kappa\)^\{2\}\}\.Then

Fκ′\(x\)=−2κ\(κ2\+3κx\+3κ\+3x2\+6x\+3\)\(1\+x\)3\(1\+x\+κ\)3<0\.F\_\{\\kappa\}^\{\\prime\}\(x\)=\-\\frac\{2\\kappa\\left\(\\kappa^\{2\}\+3\\kappa x\+3\\kappa\+3x^\{2\}\+6x\+3\\right\)\}\{\(1\+x\)^\{3\}\(1\+x\+\\kappa\)^\{3\}\}<0\.Since

tm=ρwm\+ρwt\_\{m\}=\\frac\{\\rho\_\{w\}\}\{m\+\\rho\_\{w\}\}is strictly decreasing inmm, each nonzero spectral contribution toΔkey,m\\Delta\_\{\{\\rm key\},m\}is strictly increasing inmm\. Becauseμ≠0\\mu\\neq 0, at least one such contribution is nonzero\. HenceΔkey,m\\Delta\_\{\{\\rm key\},m\}is strictly increasing inmm\.

Finally,tm→0t\_\{m\}\\to 0, sox→0x\\to 0for everyjj, and

limm→∞Δkey,m=∑λj\>0μj2λjκj\(κj\+2\)\(1\+κj\)2\>0\.\\lim\_\{m\\to\\infty\}\\Delta\_\{\{\\rm key\},m\}=\\sum\_\{\\lambda\_\{j\}\>0\}\\frac\{\\mu\_\{j\}^\{2\}\}\{\\lambda\_\{j\}\}\\frac\{\\kappa\_\{j\}\(\\kappa\_\{j\}\+2\)\}\{\(1\+\\kappa\_\{j\}\)^\{2\}\}\>0\.∎

## Appendix DDerivation of Quanta Scaling Law

To make the paper self\-contained and easier for readers to follow, this section summarizes the Quanta argument fromMichaud \[[2026](https://arxiv.org/html/2606.06888#bib.bib16)\]that we use in our paper\. The background is skill learning: next\-token prediction is assumed to require a large collection of discrete predictive skills, called*quanta*\. A model either learns a quantum or it does not, and scaling improves performance by allowing the model to learn more quanta in descending order of usefulness\.

Loss as a Function of Learned Quanta\.Index quanta by decreasing use frequency\. Letpkp\_\{k\}be the probability that thekk\-th quantum is needed on a randomly drawn token\. The Quanta model assumes a Zipf tail

pk=1Zk−\(1\+α\),Z=∑k=1∞k−\(1\+α\),α\>0\.p\_\{k\}=\\frac\{1\}\{Z\}k^\{\-\(1\+\\alpha\)\},\\qquad Z=\\sum\_\{k=1\}^\{\\infty\}k^\{\-\(1\+\\alpha\)\},\\qquad\\alpha\>0\.\(15\)In the simplest monogenic version of the model, each token depends mainly on one quantum\. Suppose learning a quantum lowers the loss on those tokens frombbtoaa, withb\>ab\>a\. If the model has learned the firstnnquanta, its expected loss is

L\(n\)=∑k=1napk\+∑k=n\+1∞bpk=a\+\(b−a\)∑k=n\+1∞pk≈a\+b−aαZn−α,\\displaystyle L\(n\)=\\sum\_\{k=1\}^\{n\}ap\_\{k\}\+\\sum\_\{k=n\+1\}^\{\\infty\}bp\_\{k\}=a\+\(b\-a\)\\sum\_\{k=n\+1\}^\{\\infty\}p\_\{k\}\\approx a\+\\frac\{b\-a\}\{\\alpha Z\}n^\{\-\\alpha\},\(16\)where the last line uses the standard tail approximation

∑k=n\+1∞k−\(1\+α\)≈∫n∞x−\(1\+α\)𝑑x=n−α/α\.\\sum\_\{k=n\+1\}^\{\\infty\}k^\{\-\(1\+\\alpha\)\}\\approx\\int\_\{n\}^\{\\infty\}x^\{\-\(1\+\\alpha\)\}dx=n^\{\-\\alpha\}/\\alpha\.Therefore

L\(n\)≈E\+Cn−α,L\(n\)\\approx E\+Cn^\{\-\\alpha\},\(17\)whereE=aE=ais the irreducible loss floor andC\>0C\>0absorbs the remaining constants\. This is the key step: a Zipf distribution over skill frequencies induces a power law in the loss as a function of the number of learned skills\.

Parameter Scaling\.If data is abundant, then the bottleneck is model capacity\. Assume each quantum requires approximatelycNc\_\{N\}parameters to represent\. A model withNNparameters can then learn

nN≈NcNn\_\{N\}\\approx\\frac\{N\}\{c\_\{N\}\}\(18\)quanta\. Substituting this into Eq\. \([17](https://arxiv.org/html/2606.06888#A4.E17)\) gives

L\(N,∞\)≈E\+ANN−α,L\(N,\\infty\)\\approx E\+A\_\{N\}N^\{\-\\alpha\},\(19\)so the parameter\-scaling exponent is

αN=α\.\\alpha\_\{N\}=\\alpha\.\(20\)
Data Scaling\.In the data\-constrained multi\-epoch regime, repeated passes over the same corpus do not create new rare skills\. The relevant resource is the number of unique tokensUU\. Assume that learning thekk\-th quantum requires at leastτ\\tautokens in the unique dataset that use that quantum\. Then the last quantum that can be learned, denotednUn\_\{U\}, satisfies

UpnU≈τ\.Up\_\{n\_\{U\}\}\\approx\\tau\.\(21\)Usingpk∝k−\(1\+α\)p\_\{k\}\\propto k^\{\-\(1\+\\alpha\)\}, we obtain

nU≈\(UZτ\)1/\(1\+α\)\.n\_\{U\}\\approx\\left\(\\frac\{U\}\{Z\\tau\}\\right\)^\{1/\(1\+\\alpha\)\}\.\(22\)Substituting again into Eq\. \([17](https://arxiv.org/html/2606.06888#A4.E17)\) yields

L\(∞,U\)≈E\+AUU−α/\(1\+α\),L\(\\infty,U\)\\approx E\+A\_\{U\}U^\{\-\\alpha/\(1\+\\alpha\)\},\(23\)so the data\-scaling exponent is

αU=α1\+α\.\\alpha\_\{U\}=\\frac\{\\alpha\}\{1\+\\alpha\}\.\(24\)This is why the data exponent is smaller than the parameter exponent in the basic Quanta picture\.

A Quanta\-Motivated Joint Law\.The single\-axis derivations above do not uniquely determine a joint\(N,U\)\(N,U\)law, but they do imply that the number of learned quanta is jointly limited by parameter capacity and unique\-data coverage\. A hard bottleneck view would write

n\(N,U\)≲min⁡\{γNN,γUU1/\(1\+α\)\},n\(N,U\)\\lesssim\\min\\left\\\{\\gamma\_\{N\}N,\\,\\gamma\_\{U\}U^\{1/\(1\+\\alpha\)\}\\right\\\},\(25\)for some constantsγN,γU\>0\\gamma\_\{N\},\\gamma\_\{U\}\>0\. Equivalently,

n\(N,U\)−1≳max⁡\{1γNN,1γUU1/\(1\+α\)\}\.n\(N,U\)^\{\-1\}\\gtrsim\\max\\left\\\{\\frac\{1\}\{\\gamma\_\{N\}N\},\\,\\frac\{1\}\{\\gamma\_\{U\}U^\{1/\(1\+\\alpha\)\}\}\\right\\\}\.\(26\)For fitting, it is convenient to replace this hard maximum by a smooth additive envelope in inverse\-skill space,

n\(N,U\)−1≈A′N\+B′U1/\(1\+α\)\.n\(N,U\)^\{\-1\}\\approx\\frac\{A^\{\\prime\}\}\{N\}\+\\frac\{B^\{\\prime\}\}\{U^\{1/\(1\+\\alpha\)\}\}\.\(27\)Substituting this intoL−E∝n−αL\-E\\propto n^\{\-\\alpha\}gives the Quanta\-motivated joint law

LQ\(N,U\)=E\+\(AN\+BU1/\(1\+α\)\)α,L\_\{\\mathrm\{Q\}\}\(N,U\)=E\+\\left\(\\frac\{A\}\{N\}\+\\frac\{B\}\{U^\{1/\(1\+\\alpha\)\}\}\\right\)^\{\\alpha\},\(28\)which is exactly the form used in Eq\. \([8](https://arxiv.org/html/2606.06888#A2.E8)\)\. This coupling should be read as a smooth interpolation motivated by the Quanta asymptotes\. It is attractive because it recovers both derived limits:

LQ\(N,∞\)\\displaystyle L\_\{\\mathrm\{Q\}\}\(N,\\infty\)=E\+AαN−α,\\displaystyle=E\+A^\{\\alpha\}N^\{\-\\alpha\},\(29\)LQ\(∞,U\)\\displaystyle L\_\{\\mathrm\{Q\}\}\(\\infty,U\)=E\+BαU−α/\(1\+α\)\.\\displaystyle=E\+B^\{\\alpha\}U^\{\-\\alpha/\(1\+\\alpha\)\}\.\(30\)Thus, the Quanta picture explains why the benefit of increasing model size should depend on the available unique data: both resources control the number of skills that can be learned, and the loss is governed by that shared latent quantity\.
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

Similar Articles

Prescriptive Scaling Laws for Data Constrained Training

Scaling Laws for Mixture Pretraining Under Data Constraints

Saliency-Aware Regularized Quantization Calibration for Large Language Models

Scaling laws for neural language models

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Submit Feedback

Similar Articles

Prescriptive Scaling Laws for Data Constrained Training
Scaling Laws for Mixture Pretraining Under Data Constraints
Saliency-Aware Regularized Quantization Calibration for Large Language Models
Scaling laws for neural language models
InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition