Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

arXiv cs.LG Papers

Summary

The paper identifies repetition mismatch as a primary cause for data mixture experiments failing to scale, and proposes a repetition-controlled subsampling procedure that allows small-scale experiments to recover near-optimal mixtures using far fewer tokens.

arXiv:2606.07597v1 Announce Type: new Abstract: Pre-training data mixtures are commonly tuned by running small-scale experiments and extrapolating to the target training budget. When high-quality data is scarce and must be repeated, this extrapolation frequently fails, but the source of the failure has not been isolated. We show that a primary culprit is a repetition mismatch: because high-quality datasets are small, their repetition rate changes as the training budget grows, shifting the optimal mixture in ways that small-scale proxy experiments do not anticipate. A subsampling procedure that matches the target repetition rate controls for this effect. In a two-source setting combining limited high-quality data with web crawl, a single repetition-controlled experiment using only 1/16 of the target tokens recovers a mixture within 0.05 of the optimum for a 757M parameter model, compared to an error of 0.75 without repetition control. Achieving comparable accuracy without repetition control requires three to four horizons, consuming 44 to 94% of the target token budget. With three data sources, the larger mixture space requires more than a single experiment to constrain, but the approach remains effective: at the 757M scale, just two repetition-controlled horizons recover the optimal mixture, outperforming baselines that instead require the full two-source experiments to construct. Our results reveal that repetition dynamics, not scale alone, shape whether small-scale mixture experiments generalize. More broadly, they suggest that data repetition deserves treatment as a first-class variable in mixture optimization, rather than an inconvenient side effect of limited data.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:47 AM

# Why Data Mixture Experiments Don’t Scale and How to Fix Them
Source: [https://arxiv.org/html/2606.07597](https://arxiv.org/html/2606.07597)
Kevin Zhou†\\dagger, Lisa Alazraki†\\dagger, Kris Cao‡\\ddagger, Marek Rei†\\dagger †\\daggerImperial College London,‡\\ddaggerCohere kevinzhou497@gmail\.com \{lisa\.alazraki20, marek\.rei\}@imperial\.ac\.uk kriscao@cohere\.com

###### Abstract

Pre\-training data mixtures are commonly tuned by running small\-scale experiments and extrapolating to the target training budget\. When high\-quality data is scarce and must be repeated, this extrapolation frequently fails, but the source of the failure has not been isolated\. We show that a primary culprit is a repetition mismatch: because high\-quality datasets are small, their repetition rate changes as the training budget grows, shifting the optimal mixture in ways that small\-scale proxy experiments do not anticipate\. A subsampling procedure that matches the target repetition rate controls for this effect\. In a two\-source setting combining limited high\-quality data with web crawl, a single repetition\-controlled experiment using only 1/16 of the target tokens recovers a mixture within 0\.05 of the optimum for a 757M parameter model, compared to an error of 0\.75 without repetition control\. Achieving comparable accuracy without repetition control requires three to four horizons, consuming 44 to 94% of the target token budget\. With three data sources, the larger mixture space requires more than a single experiment to constrain, but the approach remains effective: at the 757M scale, just two repetition\-controlled horizons recover the optimal mixture, outperforming baselines that instead require the full two\-source experiments to construct\. Our results reveal that repetition dynamics, not scale alone, shape whether small\-scale mixture experiments generalize\. More broadly, they suggest that data repetition deserves treatment as a first\-class variable in mixture optimization, rather than an inconvenient side effect of limited data\.

Repetition Mismatch: Why Data Mixture Experiments Don’t Scale and How to Fix Them

Kevin Zhou†\\dagger, Lisa Alazraki†\\dagger, Kris Cao‡\\ddagger, Marek Rei†\\dagger†\\daggerImperial College London,‡\\ddaggerCoherekevinzhou497@gmail\.com\{lisa\.alazraki20, marek\.rei\}@imperial\.ac\.ukkriscao@cohere\.com

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.07597v1/x1.png)Figure 1:Optimal WikiText repetitions across training horizons for 4 model sizes\. All models require similar repetition counts at small budgets, but diverge sharply as the budget grows, causing mixtures optimized at small scale by standard extrapolation to be systematically wrong at the target scale\.The composition of training data from multiple sources is a critical factor in language model \(LM\) pre\-training, with substantial impact on downstream performanceMirandaet al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib24)\)\. Pre\-training corpora typically combine noisy web crawl with cleaner, higher\-quality sources such as books or curated websites, and the balance between them is a key challenge: high\-quality data provides more substantive learning signals, while web crawl data helps with generalizability and regularizationElazaret al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib96)\); Longpreet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib95)\)\. As noted byShukoret al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib47)\), selecting data mixtures through trial and error is costly and time\-consuming, and several methods have been proposed for more efficient mixture selectionXieet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib89)\); Fanet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib73)\); Yeet al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib80)\)\. A common strategy is to run smaller\-scale experiments and extrapolate the results to the target training budget, yet practitioners frequently find that mixtures tuned at small scale fail to transfer to larger regimesKanget al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib100)\)\.

In this work, we identify a key factor behind this failure:repetition mismatch\. When high\-quality data is scarce, it must be repeated many times during training\. Crucially, the number of repetitions changes as the training budget grows, and existing work has shown that repeated passes over a dataset can significantly affect model performanceMuennighoffet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib31)\)\. Standard scaling\-based mixture selection ignores this effect: a small\-scale proxy experiment imposes a fundamentally different repetition regime on the high\-quality data than the target run, distorting the loss landscape and shifting the apparent optimal mixture, an effect that grows with model scale \(Figure[1](https://arxiv.org/html/2606.07597#S1.F1)\)\. We show that controlling for this repetition mismatch largely resolves the extrapolation problem\.

Our approach builds on a repetition\-aware subsampling procedure first used byLiet al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib46)\)during pre\-training\. The procedure downsamples all data sources so that the high\-quality data undergoes the same number of repetitions as in the full training run, while using only a fraction of the total tokens\. We use this procedure to isolate repetition as a variable in mixture prediction\. To test this, we compare it against a standard scaling laws\-based approach that extrapolates optimal mixture ratios from shorter training runs without matching repetition rates\.

Our experiments combine a limited high\-quality dataset – either WikiTextMerityet al\.\([2017](https://arxiv.org/html/2606.07597#bib.bib23)\)or biomedical literature from PubMed[National Center for Biotechnology Information \(NCBI\)](https://arxiv.org/html/2606.07597#bib.bib52)– with FineWebPenedoet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib57)\), a large\-scale web crawl corpus\. We first examine the two\-source case, then extend to a three\-source setting using both high\-quality datasets alongside FineWeb\. Experiments span four model sizes \(30M to 757M parameters\), allowing us to trace how model capacity interacts with repetition dynamics in mixture prediction\. Our findings are:

- •Repetition mismatch is a dominant confounder in small\-scale mixture prediction\.Matching the repetition rate of the target run, rather than just reducing the training budget, is sufficient to recover accurate mixture predictions from small\-scale experiments\. The effect is consistent across WikiText and PubMed as high\-quality sources, The effect is consistent across WikiText and PubMed as high\-quality sources, and strengthens monotonically with model size from 124M to 757M parameters\.
- •Repetition control enables accurate mixture prediction from minimal compute\.In the two\-source setting, a single repetition\-controlled experiment using∼\\sim1/16 of the target horizon tokens recovers mixtures 0\.05–0\.10 of the optimum for the 757M model across both WikiText and PubMed, compared to errors of 0\.65–0\.75 without repetition control\. Reaching comparable accuracy without repetition control requires three to four horizons, consuming 44 to 94% of the target token budget\.
- •With more data sources, the mixture space requires more experiments to constrain, but repetition control remains effective\.At 757M parameters, just two repetition\-controlled horizons recover the target optimum at a fraction of the target token budget\. At 124M, multiple repetition\-controlled horizons outperform both baselines, with the four\-horizon prediction effectively matching the optimum \(loss2\.919502\.91950vs\.2\.918202\.91820\)\.
- •Repetition rate should be an explicit knob in mixture optimization, not an incidental consequence of budget and dataset size\.Our results demonstrate that controlling for repetition dynamics, rather than treating them as a side effect of limited data, is critical for reliable mixture prediction in data\-constrained regimes\.

## 2Background

### 2\.1Data Mixing in Pre\-training

Pre\-training corpora for language models combine multiple data sources, and the proportions assigned to each source have a substantial impact on model performance\(Duet al\.,[2022](https://arxiv.org/html/2606.07597#bib.bib82); Mirandaet al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib24)\)\. Selecting effective mixtures through trial and error is costly\(Shukoret al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib47)\), motivating a range of methods that aim to predict good mixtures from smaller\-scale experiments\. These include scaling law\-based approaches that fit parametric functions to predict loss under different mixture configurations\(Geet al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib102); Shukoret al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib47); Yeet al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib80)\), proxy model methods that learn domain weights from auxiliary training signals\(Xieet al\.,[2023](https://arxiv.org/html/2606.07597#bib.bib89); Fanet al\.,[2024](https://arxiv.org/html/2606.07597#bib.bib73)\), and regression\-based approaches that treat mixture selection as a prediction task\(Liuet al\.,[2025a](https://arxiv.org/html/2606.07597#bib.bib67),[b](https://arxiv.org/html/2606.07597#bib.bib101)\)\. Here we focus on domain\-level mixing, with the goal of determining the proportion of each data source in the training mixture, rather than on strategies operating on individual samples\.

### 2\.2Data Repetition and Its Effects

When high\-quality data is limited, repeated passes over the same documents are often unavoidable at training time\. However, this repetition has well\-documented non\-linear effects on learning\.Muennighoffet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib31)\)show that for a fixed compute budget, up to approximately four repetitions of a dataset are as effective as training on new data, whereas more than four trigger diminishing returns and performance eventually plateaus\.Xueet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib99)\)extend this analysis, finding that the severity of multi\-epoch degradation depends on model size, dataset size, as well as training objective, and that larger models are more susceptible to overfitting from excessive repetition on small datasets\. Standard scaling laws for pre\-training\(Hoffmannet al\.,[2022](https://arxiv.org/html/2606.07597#bib.bib76); Kaplanet al\.,[2020](https://arxiv.org/html/2606.07597#bib.bib77)\)typically assume abundant data and do not account for these repetition effects, raising questions about their applicability in data\-constrained regimes\.

Crucially, these findings imply that the number of times a dataset is repeated during training is not merely a side effect of a limited data budget, but a variable that actively shapes the loss landscape\. When a high\-quality dataset is small relative to the training budget, the repetition count changes substantially as the budget grows\. This means that a small\-scale proxy experiment and the target training run operate under fundamentally different repetition regimes, even when they use the same mixture proportions\.

### 2\.3The Repetition Mismatch Problem

Although the effects of data repetition are well\-established, existing methods for predicting optimal data mixtures from small\-scale experiments do not explicitly control for these repetition dynamics\. Scaling law\-based approaches\(Geet al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib102); Shukoret al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib47); Yeet al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib80)\)extrapolate performance trends across training budgets without accounting for the fact that the repetition count of constrained data sources changes between the proxy and target scales\. Proxy model methods\(Xieet al\.,[2023](https://arxiv.org/html/2606.07597#bib.bib89); Fanet al\.,[2024](https://arxiv.org/html/2606.07597#bib.bib73)\)similarly learn domain weights without modeling repetition\. As a result, these methods implicitly assume that the relationship between mixture proportions and performance will hold at the target scale, an assumption that breaks down when repetition dynamics differ\.

Liet al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib46)\)introduce a repetition\-aware subsampling procedure that incidentally addresses this issue: by downsampling all data sources so that the high\-quality data undergoes the same number of repetitions as in the full training run, the procedure preserves repetition dynamics while using only a fraction of the total tokens\.Liet al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib46)\)use this procedure to inform data mixture decisions during pre\-training, but do not isolate repetition mismatch as a distinct phenomenon or characterize when it matters\. In this work, we identify repetition mismatch as a previously unrecognized confounder in data mixing research, and show that controlling for it addresses the extrapolation failure of small\-scale mixture predictions across model sizes, dataset choices, and number of data sources\.

## 3Experimental Setup

To test whether repetition mismatch explains the failure of small\-scale mixture extrapolation, we conduct experiments across multiple high\-quality datasets, model sizes, and numbers of training domains\.111Our code is available at[https://github\.com/kevinzhou497/data\-mixing\-language\-models](https://github.com/kevinzhou497/data-mixing-language-models)

### 3\.1Datasets

We use datasets that differ in size and quality and are commonly employed in language model pre\-trainingYanget al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib97)\); Boltonet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib98)\), allowing us to study how repetition dynamics affect mixture prediction when combining smaller, high\-quality datasets with larger, noisier sources\. Additional details of each dataset are provided in Appendix[A](https://arxiv.org/html/2606.07597#A1)\.

#### WikiText

Merityet al\.\([2017](https://arxiv.org/html/2606.07597#bib.bib23)\)contains articles from Wikipedia’sGood and Featuredlist, providing a high\-quality data source that contrasts with more general web crawl\. In our experiments, we use thewikitext\-103\-raw\-v1instance\.222[https://huggingface\.co/datasets/Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext)After tokenization, the training set contains 116,881,107 tokens\. Model performance is evaluated on a held\-out WikiText validation set of 131,072 tokens\.

In all experiments, we exclude web crawl data from the validation set and evaluate performance exclusively on high\-quality domains\. This allows us to more directly assess the effects of data repetition and mixture composition, as validation on noisy web\-sourced text can obscure differences induced by mixing strategies\. Focusing on curated domains provides a more stable and interpretable evaluation signal when high\-quality data is the primary object of optimization, consistent with prior studies that evaluate pre\-training mixtures using curated\-domain validation dataMuennighoffet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib31)\)\.

#### PubMed

is a collection of biomedical literature\([National Center for Biotechnology Information \(NCBI\),](https://arxiv.org/html/2606.07597#bib.bib52)\), with a corresponding dataset333[https://huggingface\.co/datasets/ncbi/pubmed](https://huggingface.co/datasets/ncbi/pubmed)that contains citation records for its articles\. Many of these records include the text of the abstract, which we use as our data samples\. To roughly match the size of the WikiText training set, we sample abstracts until the total number of tokens reaches approximately 120 million, resulting in a training set of 120,000,060 tokens\. Evaluation is performed on a held\-out PubMed validation set of 131,072 tokens, consistent with our procedure for WikiText\.

#### FineWeb

Penedoet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib57)\)is a large\-scale web\-crawled text corpus\.Penedoet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib57)\)have shown that this corpus leads to stronger language model performance compared to other web crawl datasets such as RefinedWebPenedoet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib65)\)and C4Raffelet al\.\([2020](https://arxiv.org/html/2606.07597#bib.bib49)\)\.

We use the FineWeb\-10BT dataset, a subsample of approximately 10 billion tokens\. At this scale and with the training horizons we employ, no repetitions of the FineWeb dataset occur, creating a clear contrast with the smaller high\-quality domain datasets\. This reflects realistic scenarios where limited domain\-specific data is supplemented by abundant web\-crawled text, and where repetition is a factor for only the high\-quality sources\.

### 3\.2Models

We employ a modified version of NanoGPT\(Karpathy,[2022](https://arxiv.org/html/2606.07597#bib.bib90)\)from themodded\-nanogptrepository\(Jordanet al\.,[2024a](https://arxiv.org/html/2606.07597#bib.bib19)\)\. This architecture reproduces GPT\-2, with enhancements including the Muon optimizer\(Jordanet al\.,[2024b](https://arxiv.org/html/2606.07597#bib.bib22)\)and Rotary Positional Embeddings \(RoPE\)Suet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib21)\)\. Further details are provided in Appendix[B](https://arxiv.org/html/2606.07597#A2)\.

Four model sizes are considered, obtained by varying the number of layers and the embedding dimension, resulting in approximately 30 million, 124 million, 345 million, and 757 million parameter models\. This range allows us to trace how model capacity interacts with repetition dynamics in mixture prediction, which is central to our analysis\. Training results for different horizons are obtained separately for each model size, isolating the effect of model capacity on the severity of repetition mismatch\.

### 3\.3Data Mixing Objective

LetD=\{𝒟1,…,𝒟n\}D=\\\{\\mathcal\{D\}\_\{1\},\.\.\.,\\mathcal\{D\}\_\{n\}\\\}be a set of datasets andT⋆T\_\{\\star\}be the target training horizon\. The goal is to find an optimal target mixture vector𝒎∗​\(T⋆\)=\[m1∗,…,mn∗\]\\boldsymbol\{m\}^\{\*\}\(T\_\{\\star\}\)=\[m^\{\*\}\_\{1\},\.\.\.,m^\{\*\}\_\{n\}\]of lengthnn, with∑mi∗=1\\sum m^\{\*\}\_\{i\}=1and0≤mi∗≤10\\leq m^\{\*\}\_\{i\}\\leq 1\.

Since model size is fixed, the task is to predict the optimal mixture at the full target horizon using the same model trained on smaller token budgets\. Specifically, after obtaining optimal mixture vectors\{𝒎~​\(Tj\)\}j=1h\\\{\\boldsymbol\{\\tilde\{m\}\}\(T\_\{j\}\)\\\}\_\{j=1\}^\{h\}for smaller horizons0<T1<⋯<Th<T⋆0<T\_\{1\}<\\cdots<T\_\{h\}<T\_\{\\star\}, we aim to predict the target mixture

𝒎∗​\(T⋆\)∈arg⁡min𝒎∈Δn−1⁡ℒ​\(𝒎;T⋆\),\\boldsymbol\{m\}^\{\*\}\(T\_\{\\star\}\)\\ \\in\\ \\arg\\min\_\{\\boldsymbol\{m\}\\in\\Delta^\{n\-1\}\}\\ \\mathcal\{L\}\(\\boldsymbol\{m\};T\_\{\\star\}\),whereΔn−1=\{𝒎∈ℝ≥0n:∑i=1nmi=1\}\\Delta^\{n\-1\}=\\\{\\boldsymbol\{m\}\\in\\mathbb\{R\}\_\{\\geq 0\}^\{n\}:\\sum\_\{i=1\}^\{n\}m\_\{i\}=1\\\}is the probability simplex andℒ​\(𝒎;T⋆\)\\mathcal\{L\}\(\\boldsymbol\{m\};T\_\{\\star\}\)is the average cross\-entropy loss on held\-out validation data for mixture𝒎\\boldsymbol\{m\}and horizonT⋆T\_\{\\star\}\.

Our objective is to predict𝒎∗​\(T⋆\)\\boldsymbol\{m\}^\{\*\}\(T\_\{\\star\}\)as accurately as possible while minimizing the number of required experiments to reduce computational costs\.

## 4Two\-Source Data Mixtures

![Refer to caption](https://arxiv.org/html/2606.07597v1/x2.png)Figure 2:Cross\-entropy loss on the validation set at the end of the training run plotted for the 757M model when using WikiText as the high\-quality domain data\. The scaling laws\-based experiment results are shown with the dotted lines, and the repeat\-aware results are shown with the solid lines\.We first investigate the two\-source case: high\-quality data from either WikiText or PubMed combined with FineWeb\. To isolate the role of repetition mismatch, we compare mixture predictions obtained with and without repetition control, holding all other experimental variables constant\.

#### Without Repetition Control \(Scaling Laws\)\.

We identify the optimal data mixture at each of 5 training horizons, with each subsequent horizon approximately doubling in length\. For each horizon, we measure the optimal number of repetitions of the high\-quality dataset, equivalent to the optimal mixing ratio in the two\-source case\. We then predict the target\-horizon mixture in four ways: \(i\) using only the smallest horizon as a direct extrapolation; \(ii–iv\) fitting a linear regression over the two, three, or four smallest horizons between training tokens and optimal repetitions\. Unlike prior data mixing scaling laws, which predict loss or perplexity, we predict the optimal mixing ratio directly, as our goal is to evaluate how accurately small\-scale experiments recover the target mixture\. Each model size is treated as a separate set of experiments\. For each horizon, we sweep over mixing ratios in increments of0\.050\.05, training until a U\-shaped curve in the validation loss emerges, indicating that the optimal ratio has been bracketed\. Across experiments, the validation loss consistently decreases monotonically until the optimal ratio and then increases, confirming this as a reliable stopping criterion\.

#### With Repetition Control \(Repeat\-aware\)\.

To control for repetition mismatch, we apply the subsampling procedure described in Section[2\.3](https://arxiv.org/html/2606.07597#S2.SS3), using the same target horizon and subsamples of116,18,14,and​12\\frac\{1\}\{16\},\\frac\{1\}\{8\},\\frac\{1\}\{4\},\\mbox\{and \}\\frac\{1\}\{2\}, with subsampling performed at the document level\.

To illustrate, let the target horizon containT⋆T\_\{\\star\}tokens, the high\-quality dataset beDDwith lengthnDn\_\{D\}, and the mixture proportion behh\. In the full training setup, the number of repetitions ofDDisT⋆×hnD\\frac\{T\_\{\\star\}\\times h\}\{n\_\{D\}\}\. In the1S\\frac\{1\}\{S\}subsample scenario, only the first1S\\frac\{1\}\{S\}fraction of documents fromDDis used, and the training horizon is reduced to1S×T⋆\\frac\{1\}\{S\}\\times T\_\{\\star\}tokens\. Keeping the same mixture proportionhhand assuming reasonably uniform document lengths, the number of repetitions becomes

T⋆×1S×hnD×1S=T⋆×hnD\.\\frac\{T\_\{\\star\}\\times\\frac\{1\}\{S\}\\times h\}\{n\_\{D\}\\times\\frac\{1\}\{S\}\}=\\frac\{T\_\{\\star\}\\times h\}\{n\_\{D\}\}\.
Thus, this setup preserves the same number of repetitions as the full scenario while using only1S\\frac\{1\}\{S\}of the total tokens\. We compute the optimal mixture at each subsampled horizon and predict the target mixture using the same four formulations as above\. Further experimental details, including hyperparameters, are provided in Appendix[C](https://arxiv.org/html/2606.07597#A3)\.

### 4\.1Two\-Source Results and Discussion

Figure[2](https://arxiv.org/html/2606.07597#S4.F2)presents results for the 757M model with WikiText as the high\-quality source\. Training token counts across corresponding horizons differ slightly between setups, as repeat\-aware subsampling operates at the document level\.

Across all model sizes and both high\-quality sources, a consistent pattern emerges: without repetition control, the optimal proportion of high\-quality data decreases as the training budget increases, reflecting the diminishing returns from excessive repetition documented byMuennighoffet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib31)\)\. PubMed and WikiText follow remarkably similar trajectories, underscoring the generalizability of these findings\. Full mixture results are in Appendix[D](https://arxiv.org/html/2606.07597#A4)\.

High\-Quality DatasetModel SizeScaling Laws Prediction ErrorRepeat\-aware Prediction Error1\-H2\-H3\-H4\-H1\-H2\-H3\-H4\-HWikiText30M0\.2500\.0640\.0600\.0390\.5500\.2560\.0130\.044124M0\.6500\.0340\.0060\.0010\.2000\.2000\.0970\.062345M0\.7500\.1290\.013≤\\leq0\.050\.1000\.1000\.0110\.017757M0\.7500\.0280\.0100\.0060\.0500\.0500\.050≤\\leq0\.05PubMed30M0\.3000\.1140\.1100\.0590\.5000\.3150\.1190\.079124M0\.6500\.2590\.0440\.0320\.2000\.2000\.0950\.061345M0\.7500\.1290\.0320\.0120\.1000\.1000\.0110\.016757M0\.6500\.0110\.0010\.0290\.1000\.1000\.1000\.050

Table 1:Distances from the optimal mixture across model sizes, measured by the absolute difference in FineWeb proportion across datasets and prediction horizons\. Since these experiments involve only two data sources, this difference directly reflects the mixture prediction error\. The better\-performing method is inbold; differences smaller than the 0\.05 mixing\-ratio sweep granularity should be interpreted as ties\. Cells reported as≤\\leq0\.05 indicate that the prediction landed within one increment of the 0\.05 mixing\-ratio sweep from the optimum, the smallest difference resolvable by our search\. The 1\-H, 2\-H, etc\. column headers refer to the 1\-Horizon, 2\-Horizon, etc\. predictions\.#### Repetition Control Stabilizes Mixture Predictions\.

The most striking feature of Figure[2](https://arxiv.org/html/2606.07597#S4.F2)is the tight clustering of optimal mixing ratios across horizons under repetition control, compared to the large drift without it\. Table[1](https://arxiv.org/html/2606.07597#S4.T1)quantifies this pattern\.

#### Single\-Horizon Predictions\.

The single\-horizon case most directly reveals the effect of repetition mismatch\. Using only the smallest horizon \(∼116\\sim\\frac\{1\}\{16\}of the target tokens\), repetition control recovers a mixture within0\.0500\.050of the optimum for the 757M model on WikiText, versus0\.7500\.750without it\. PubMed shows the same pattern \(0\.1000\.100versus0\.6500\.650\), as does the 345M model on both domains \(0\.1000\.100versus0\.7500\.750\)\. With WikiText, this “one\-shot” prediction uses only∼\\sim232M tokens compared to 3\.74B at the target \(∼\\sim241M and 3\.84B with PubMed\), yet recovers a near\-optimal mixture at a fraction of the cost\.

#### Multiple Horizons\.

ModelSizeMixingRatioLearningRateAvg\.Validation LossExperimentType124M0\.3, 0\.35, 0\.350\.0012\.94270Baseline 10\.45, 0\.25, 0\.30\.0012\.91820Optimal Mixture0\.51, 0\.245, 0\.2450\.0012\.91950Four\-Horizon Prediction0\.56, 0\.22, 0\.220\.0012\.92830Three\-Horizon Prediction0\.57, 0\.215, 0\.2150\.0012\.92965Two\-Horizon Prediction0\.65, 0\.175, 0\.1750\.0012\.95570Baseline 20\.75, 0\.125, 0\.1250\.001413\.01300Single\-Horizon Prediction757M0\.65, 0\.175, 0\.1750\.0012\.7699Optimal / Two\-Horizon Prediction0\.65, 0\.15, 0\.200\.0012\.7751Baseline 10\.825, 0\.075, 0\.1000\.0012\.8337Baseline 20\.85, 0\.075, 0\.0750\.0012\.8518Single\-Horizon Prediction

Table 2:Key results from the three\-source repeat\-aware experiments at the full training horizon for the 124M and 757M models\. Mixing ratios are shown as proportions of FineWeb, WikiText, and PubMed, with the optimal mixture per model size highlighted inbold\.The advantage narrows with multiple horizons: repetition control wins 5 of 12 multi\-horizon comparisons for the 345M and 757M models, and both approaches converge to accurate predictions \(within0\.050\.05for the 757M WikiText experiments; within0\.0060\.006at four horizons\)\. Many of these multi\-horizon differences fall at or below the 0\.05 sweep granularity, so the two methods are effectively tied in this regime\. Since each additional horizon roughly doubles the token cost, the practical value of repetition control lies primarily in the single\-horizon regime\.

#### The Role of Model Capacity\.

The benefit of repetition control depends strongly on model capacity \(Table[1](https://arxiv.org/html/2606.07597#S4.T1)\)\. The single\-horizon improvement shrinks from0\.700\.70at 757M parameters to0\.450\.45at 124M, and at 30M, repetition control is outperformed by scaling\-law extrapolation, identifying a lower bound on the model scale where the method applies\. Figure[1](https://arxiv.org/html/2606.07597#S1.F1)reveals the underlying mechanism: the optimal repetition count at the target horizon ranges from∼\\sim5 for the 757M model to∼\\sim24 for the 30M\. Because subsampling preserves the repetition count while reducing the absolute number of high\-quality tokens, smaller models that require many repetitions at the target horizon end up with too few unique tokens to learn from\. Repetition control therefore works best when the model is large enough to extract signal from high\-quality tokens efficiently\. The 30M result establishes this lower bound empirically; the 124M, 345M, and 757M results show the method works above it\.

Notably, the optimal repetition counts for the 30M and 124M models substantially exceed the∼\\sim4\-repetition threshold identified byMuennighoffet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib31)\), beyond which diminishing returns typically begin\. Rather than contradicting this finding, this suggests that the threshold is model\-size\-dependent\. The PubMed experiments exhibit the same pattern \(Figure[4](https://arxiv.org/html/2606.07597#A3.F4), Appendix[D](https://arxiv.org/html/2606.07597#A4)\)\.

## 5Three\-Source Data Mixtures

We next test whether repetition control remains effective with a larger mixture space\. We use three data sources: WikiText and PubMed as high\-quality datasets, and FineWeb as the web crawl\. Model evaluation averages the loss on the WikiText and PubMed validation sets\. This setup reflects common pre\-training configurations that combine multiple high\-quality sources with general web crawl, such as the data mixture used for the first LLaMA modelsTouvronet al\.\([2023](https://arxiv.org/html/2606.07597#bib.bib69)\)\.

#### Experimental Procedure\.

We follow the same procedure as in the two\-source case, using the same subsample proportions and a full target horizon of∼\\sim3\.79 billion tokens\. We run experiments at both the 124M and 757M model scales, mirroring the two\-source setup and allowing us to test whether the model\-capacity trend from Section[4\.1](https://arxiv.org/html/2606.07597#S4.SS1)carries over to the larger mixture space\. Since WikiText and PubMed differ slightly in total token count, we average the iteration counts from the corresponding two\-source experiments at each horizon\.

#### Baselines\.

We compare against two baselines derived from the two\-source results at each model scale: \(i\) using the optimal proportion of each high\-quality domain from its respective two\-source experiment, and \(ii\) averaging the optimal high\-quality proportions across the WikiText and PubMed two\-source experiments, allocated in proportion to their two\-source optima\. In both cases, the remainder is assigned to FineWeb\. For the 124M model, both two\-source optima are 0\.35, yielding Baseline 1 =\[0\.30,0\.35,0\.35\]\[0\.30,0\.35,0\.35\]and Baseline 2 =\[0\.65,0\.175,0\.175\]\[0\.65,0\.175,0\.175\]\. For 757M, the WikiText and PubMed optima are 0\.15 and 0\.20, yielding Baseline 1 =\[0\.65,0\.15,0\.20\]\[0\.65,0\.15,0\.20\]and Baseline 2 =\[0\.825,0\.075,0\.100\]\[0\.825,0\.075,0\.100\]\. Both baselines use target\-horizon two\-source optima, giving them strictly more information about the two\-source structure of the problem than any small\-scale extrapolation would have; comparisons against them are therefore conservative\.

### 5\.1Three\-Source Results and Discussion

Table[2](https://arxiv.org/html/2606.07597#S4.T2)presents the three\-source results at the target horizon for the 124M and 757M models\.

#### One horizon is the floor; more horizons rapidly close the gap\.

With three sources, a single repetition\-controlled horizon yields mixtures that approach but do not reach the target optimum:\[0\.75,0\.125,0\.125\]\[0\.75,0\.125,0\.125\]at 124M against a true optimum of\[0\.45,0\.25,0\.3\]\[0\.45,0\.25,0\.3\], and\[0\.85,0\.075,0\.075\]\[0\.85,0\.075,0\.075\]at 757M against\[0\.65,0\.175,0\.175\]\[0\.65,0\.175,0\.175\]\. Both predictions drift toward higher FineWeb proportions, leaving a loss gap of∼\\sim0\.08–0\.10 from the optimum, consistent with the larger mixture space requiring more than one experiment to constrain\. As we show next, adding even a single additional horizon dramatically narrows this gap\.

#### Two horizons suffice at larger model scale\.

At 757M, two horizons close the remaining gap to the optimum\. The two\-horizon repetition\-controlled prediction yields a FineWeb proportion of0\.650\.65, giving a mixture of\[0\.65,0\.175,0\.175\]\[0\.65,0\.175,0\.175\]that recovers the target optimum at sweep granularity \(avg\. loss2\.76992\.7699\)\. It outperforms Baseline 2 \(2\.83372\.8337\) and matches Baseline 1 \(2\.77512\.7751\)\. Two short repetition\-controlled runs in the three\-source setting recover the optimal mixture, while the closest competing baseline requires two complete two\-source sweeps as a prerequisite\. This mirrors the model\-capacity trend observed in the two\-source experiments: as model size grows, repetition control becomes increasingly sample\-efficient, and the number of horizons needed to constrain the mixture space drops\.

![Refer to caption](https://arxiv.org/html/2606.07597v1/x3.png)Figure 3:Cumulative training tokens across horizons for the WikiText two\-source experiments, as a percentage of the target token budget\. Each additional horizon roughly doubles the cumulative cost: by four horizons, a sweep consumes nearly the full target budget, making the single\-horizon regime the most cost\-efficient when accuracy permits\.
#### Multi\-horizon predictions converge to the optimum at smaller scale\.

At the 124M scale, where two horizons are not yet sufficient, additional horizons progressively close the gap to the optimum\. Fitting linear regressions over the smallest two, three, and four horizons predicts FineWeb proportions of approximately0\.570\.57,0\.560\.56, and0\.510\.51respectively\. Distributing the remaining budget evenly between WikiText and PubMed and training at these predicted mixtures, all three outperform both baselines \(Table[10](https://arxiv.org/html/2606.07597#A5.T10)\)\. The four\-horizon prediction \(\[0\.51,0\.245,0\.245\]\[0\.51,0\.245,0\.245\]\) achieves a loss of2\.919502\.91950, effectively matching the true optimum of2\.918202\.91820, while the two\- and three\-horizon predictions reach competitive accuracy at substantially lower cost\. Together with the 757M two\-horizon result, these findings show that repetition control remains effective in the three\-source setting, with the number of horizons needed shrinking as model capacity grows, mirroring the same interaction between repetition dynamics and model scale that underlies the two\-source results\.

## 6Compute Cost of Mixture Prediction

Both approaches use the same training horizons and sweep over mixing ratios and learning rates identically, so the cost at any given horizon is the same; the practical difference is how many horizons are needed\. In the two\-source experiments, repetition control achieves a prediction error of just0\.050\.05for the 757M model using a single horizon, roughly6%6\\%of the target token budget\. Without repetition control, the same horizon yields an error of0\.750\.75, and reaching comparable accuracy requires three to four horizons, consuming4444to94%94\\%of the target budget\. With three sources, the larger mixture space requires more than one horizon, but two repetition\-controlled horizons \(roughly19%19\\%of the target budget\) recover the optimum at 757M and beat both baselines at 124M\. The savings thus come entirely from needing fewer horizons, and grow with the target training budget; for precise per\-horizon token counts and a trillion\-token extrapolation, see Appendix[E](https://arxiv.org/html/2606.07597#A5)\.

## 7Conclusion

In this work, we set out to understand why data mixtures tuned at small scale often fail to transfer to larger training budgets in data\-constrained settings\. Our experiments point to a clear culprit: repetition mismatch\. When high\-quality data is scarce, it must be repeated during training, and the number of repetitions changes as the training budget grows\. Small\-scale proxy experiments and full\-scale target runs therefore operate under different repetition regimes, and standard scaling\-based approaches do not account for this\.

Controlling for repetition resolves the problem\. A single repetition\-controlled horizon using∼\\sim1/16 of the target tokens recovers a mixture within 0\.05 of the optimum at 757M; this advantage strengthens monotonically with model capacity; and in a three\-source setting, as few as two horizons match the target optimum at larger scale\. Each of these results required holding repetition fixed between proxy and target experiments rather than letting it drift with the training budget\.

Repetition control is also a simple intervention, operating at the dataset level and requiring no parametric modeling, proxy training runs, or hyperparameter tuning beyond the mixing\-ratio sweep that any mixture prediction method already performs\. It is therefore orthogonal to existing approaches and could be incorporated into any of them, yet to our knowledge no current method does so\.

More broadly, these findings suggest that data repetition deserves to be treated as a primary variable in mixture optimization rather than an inconvenient side effect of limited data\. Methods that predict mixtures from small\-scale proxy experiments in data\-constrained regimes should control for repetition dynamics, as failing to do so risks systematic prediction errors that grow with both model capacity and the gap between proxy and target scales\. As practitioners increasingly rely on smaller ablation runs to inform mixture decisions for billion\-parameter models, accounting for repetition mismatch will only become more important\.

## Limitations

Our experiments use models up to 757M parameters and training horizons up to∼\\sim3\.8B tokens, both smaller than modern LLM pre\-training\. Studying trends at smaller scales to extrapolate to larger ones is standard practice in scaling laws research\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.07597#bib.bib77); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.07597#bib.bib76)\), and the consistent trend of repetition control becoming more effective with model size gives us reason to expect these findings will carry over to billion\-parameter models, though empirical confirmation at billion\-parameter scale is beyond the resources of this work and we leave it as a direction for future investigation\.

Our results are based on single runs per configuration, consistent with prevailing practice in scaling laws and data mixing research\(Bordt and Pawelczyk,[2026](https://arxiv.org/html/2606.07597#bib.bib105); Magnussonet al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib106)\), where the cost of pre\-training experiments makes multi\-seed sweeps impractical at the scale of mixture\-ratio and horizon grids we explore\. Our claims accordingly rest on aggregate trends, such as repetition control’s scaling with model capacity, that hold consistently across both high\-quality datasets and all four model sizes\. Where individual cell differences fall at or below the 0\.05 mixing\-ratio sweep granularity, we treat the methods as effectively tied rather than relying on the precise values\.

We compare repetition control against a scaling\-laws\-based baseline that fits a linear regression to optimal mixing ratios across horizons\. This formulation is a stylized model of the practitioner workflow of running mixture sweeps at smaller scales and projecting trends forward, used in industry pre\-training pipelines such as Llama 3\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.07597#bib.bib107)\)and identified as standard practice byShukoret al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib47)\)andKanget al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib100)\)\. No published mixture\-prediction method, to our knowledge, includes per\-source repetition as an input variable, including the parametric approaches ofYeet al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib80)\),Shukoret al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib47)\),Geet al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib102)\),Kanget al\.\([2025](https://arxiv.org/html/2606.07597#bib.bib100)\), andLiuet al\.\([2025b](https://arxiv.org/html/2606.07597#bib.bib101)\)\. Combining repetition\-aware subsampling with these methods is a natural extension of our findings\.

Our evaluation focuses on validation loss over the high\-quality domains \(WikiText and PubMed\)\. This choice is consistent with prior data mixing work\(Yeet al\.,[2025](https://arxiv.org/html/2606.07597#bib.bib80); Muennighoffet al\.,[2023](https://arxiv.org/html/2606.07597#bib.bib31)\)and reflects a methodological consideration in our setup: because FineWeb is unrepeated across all horizons, web\-crawl perplexity primarily tracks exposure to FineWeb tokens rather than mixture quality, while the signal that distinguishes mixtures is most cleanly observable on the repeated high\-quality sources\. Confirming that our findings transfer to downstream benchmarks remains an important direction for future work\.

Our experiments combine one or two high\-quality datasets with a single large web crawl, a simplified setup compared to real\-world pre\-training corpora, which typically draw from seven or more sources with varying repetition rates\(Touvronet al\.,[2023](https://arxiv.org/html/2606.07597#bib.bib69); Weberet al\.,[2024](https://arxiv.org/html/2606.07597#bib.bib68)\)\. This few\-source design follows prior work studying repetition effects and data composition in controlled settings\(Muennighoffet al\.,[2023](https://arxiv.org/html/2606.07597#bib.bib31); Xueet al\.,[2023](https://arxiv.org/html/2606.07597#bib.bib99)\), and the repetition\-aware procedure generalizes naturally to any number of sources, though we leave its empirical behaviour on more diverse mixtures to future investigation\.

All experiments use English\-language datasets\. Since data scarcity is often more acute for non\-English languagesJoshiet al\.\([2020](https://arxiv.org/html/2606.07597#bib.bib104)\), repetition mismatch may be an even greater concern in multilingual settings, which we encourage future work to investigate\.

## Ethical Considerations

All datasets employed in our experiments are publicly available and have been used in accordance with their respective licenses\. Our models are based on open\-source architectures and are trained using open\-source software released under permissive licenses\. Additionally, we will share all our code publicly to support reproducibility\.

Pre\-training language models is computationally expensive, and the experimental sweeps required to study data mixing compound this cost\. Our results show that controlling for repetition can reduce the experimental budget needed to identify effective mixtures\. This may help reduce the computational and environmental cost of mixture selection as data mixing studies become standard practice for billion\-parameter pre\-training\.

Our PubMed experiments use abstracts from the publicly released HuggingFace PubMed dataset\. We did not extract or process any personally identifiable information, and all biomedical content used is already publicly distributed for research purposes\.

## References

- Old optimizer, new norm: an anthology\.InOPT 2024: Optimization for Machine Learning,External Links:[Link](https://openreview.net/forum?id=ux18f5nOpD)Cited by:[§B\.1](https://arxiv.org/html/2606.07597#A2.SS1.p2.1)\.
- E\. Bolton, A\. Venigalla, M\. Yasunaga, D\. Hall, B\. Xiong, T\. Lee, R\. Daneshjou, J\. Frankle, P\. Liang, M\. Carbin, and C\. D\. Manning \(2024\)BioMedLM: a 2\.7b parameter language model trained on biomedical text\.External Links:2403\.18421,[Link](https://arxiv.org/abs/2403.18421)Cited by:[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.p1.1)\.
- S\. Bordt and M\. Pawelczyk \(2026\)Train once, answer all: many pretraining experiments for the cost of one\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=EoBmdFujak)Cited by:[Limitations](https://arxiv.org/html/2606.07597#Sx1.p2.1)\.
- N\. Du, Y\. Huang, A\. M\. Dai, S\. Tong, D\. Lepikhin, Y\. Xu, M\. Krikun, Y\. Zhou, A\. W\. Yu, O\. Firat, B\. Zoph, L\. Fedus, M\. P\. Bosma, Z\. Zhou, T\. Wang, E\. Wang, K\. Webster, M\. Pellat, K\. Robinson, K\. Meier\-Hellstern, T\. Duke, L\. Dixon, K\. Zhang, Q\. Le, Y\. Wu, Z\. Chen, and C\. Cui \(2022\)GLaM: efficient scaling of language models with mixture\-of\-experts\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 5547–5569\.External Links:[Link](https://proceedings.mlr.press/v162/du22c.html)Cited by:[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1)\.
- Y\. Elazar, A\. Bhagia, I\. Magnusson, A\. Ravichander, D\. Schwenk, A\. Suhr, P\. Walsh, D\. Groeneveld, L\. Soldaini, S\. Singh, H\. Hajishirzi, N\. Smith, and J\. Dodge \(2024\)What's in my big data?\.InInternational Conference on Representation Learning,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 7735–7790\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/1f7336fd66b6e6e63d1801fdd5930a5a-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1)\.
- S\. Fan, M\. Pagliardini, and M\. Jaggi \(2024\)DOGE: domain reweighting with generalization estimation\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 12895–12915\.External Links:[Link](https://proceedings.mlr.press/v235/fan24e.html)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.07597#S2.SS3.p1.1)\.
- C\. Ge, Z\. Ma, D\. Chen, Y\. Li, and B\. Ding \(2025\)BiMix: a bivariate data mixing law for language model pretraining\.External Links:2405\.14908,[Link](https://arxiv.org/abs/2405.14908)Cited by:[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.07597#S2.SS3.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p3.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[Limitations](https://arxiv.org/html/2606.07597#Sx1.p3.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. de Las Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, O\. Vinyals, J\. W\. Rae, and L\. Sifre \(2022\)Training compute\-optimal large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088,[Link](https://dl.acm.org/doi/10.5555/3600270.3602446)Cited by:[§2\.2](https://arxiv.org/html/2606.07597#S2.SS2.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p1.1)\.
- K\. Jordan, J\. Bernstein, B\. Rappazzo, @fernbear\.bsky\.social, B\. Vlado, Y\. Jiacheng, F\. Cesista, B\. Koszarsky, and @Grad62304977 \(2024a\)Modded\-nanogpt: speedrunning the NanoGPT baseline\.External Links:[Link](https://github.com/KellerJordan/modded-nanogpt)Cited by:[§B\.1](https://arxiv.org/html/2606.07597#A2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.07597#S3.SS2.p1.1)\.
- K\. Jordan, Y\. Jin, V\. Boza, J\. You, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024b\)Muon: an optimizer for hidden layers in neural networks\.External Links:[Link](https://kellerjordan.github.io/posts/muon/)Cited by:[§B\.1](https://arxiv.org/html/2606.07597#A2.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.07597#S3.SS2.p1.1)\.
- P\. Joshi, S\. Santy, A\. Budhiraja, K\. Bali, and M\. Choudhury \(2020\)The state and fate of linguistic diversity and inclusion in the NLP world\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 6282–6293\.External Links:[Link](https://aclanthology.org/2020.acl-main.560/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.560)Cited by:[Limitations](https://arxiv.org/html/2606.07597#Sx1.p6.1)\.
- F\. Kang, Y\. Sun, B\. Wen, S\. Chen, D\. Song, R\. Mahmood, and R\. Jia \(2025\)AutoScale: scale\-aware data mixing for pre\-training LLMs\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=rujwIvjooA)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p3.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.CoRRabs/2001\.08361\.External Links:[Link](https://arxiv.org/abs/2001.08361),2001\.08361Cited by:[§2\.2](https://arxiv.org/html/2606.07597#S2.SS2.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p1.1)\.
- A\. Karpathy \(2022\)NanoGPT\.GitHub\.Note:[https://github\.com/karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)Cited by:[§B\.1](https://arxiv.org/html/2606.07597#A2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.07597#S3.SS2.p1.1)\.
- A\. Li, B\. Gong, B\. Yang, B\. Shan, C\. Liu, C\. Zhu, C\. Zhang, C\. Guo, D\. Chen, D\. Li, E\. Jiao, G\. Li, G\. Zhang, H\. Sun, H\. Dong, J\. Zhu, J\. Zhuang, J\. Song, J\. Zhu, J\. Han, J\. Li, J\. Xie, J\. Xu, J\. Yan, K\. Zhang, K\. Xiao, K\. Kang, L\. Han, L\. Wang, L\. Yu, L\. Feng, L\. Zheng, L\. Chai, L\. Xing, M\. Ju, M\. Chi, M\. Zhang, P\. Huang, P\. Niu, P\. Li, P\. Zhao, Q\. Yang, Q\. Xu, Q\. Wang, Q\. Wang, Q\. Li, R\. Leng, S\. Shi, S\. Yu, S\. Li, S\. Zhu, T\. Huang, T\. Liang, W\. Sun, W\. Sun, W\. Cheng, W\. Li, X\. Song, X\. Su, X\. Han, X\. Zhang, X\. Hou, X\. Min, X\. Zou, X\. Shen, Y\. Gong, Y\. Zhu, Y\. Zhou, Y\. Zhong, Y\. Hu, Y\. Fan, Y\. Yu, Y\. Yang, Y\. Li, Y\. Huang, Y\. Li, Y\. Huang, Y\. Xu, Y\. Mao, Z\. Li, Z\. Li, Z\. Tao, Z\. Ying, Z\. Cong, Z\. Qin, Z\. Fan, Z\. Yu, Z\. Jiang, and Z\. Wu \(2025\)MiniMax\-01: scaling foundation models with lightning attention\.CoRRabs/2501\.08313\.External Links:[Link](https://doi.org/10.48550/arXiv.2501.08313),[Document](https://dx.doi.org/10.48550/ARXIV.2501.08313),2501\.08313Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.07597#S2.SS3.p2.1)\.
- F\. Liu, W\. Zhou, B\. Liu, Z\. Yu, Y\. Zhang, H\. Lin, Y\. Yu, B\. Zhang, X\. Zhou, T\. Wang, and Y\. Cao \(2025a\)QuaDMix: quality\-diversity balanced data selection for efficient llm pretraining\.External Links:2504\.16511,[Link](https://arxiv.org/abs/2504.16511)Cited by:[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1)\.
- Q\. Liu, X\. Zheng, N\. Muennighoff, G\. Zeng, L\. Dou, T\. Pang, J\. Jiang, and M\. Lin \(2025b\)RegMix: data mixture as regression for language model pre\-training\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=5BjQOUXq7i)Cited by:[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p3.1)\.
- S\. Longpre, G\. Yauney, E\. Reif, K\. Lee, A\. Roberts, B\. Zoph, D\. Zhou, J\. Wei, K\. Robinson, D\. Mimno, and D\. Ippolito \(2024\)A pretrainer’s guide to training data: measuring the effects of data age, domain coverage, quality, & toxicity\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 3245–3276\.External Links:[Link](https://aclanthology.org/2024.naacl-long.179/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.179)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1)\.
- I\. Magnusson, N\. Tai, B\. Bogin, D\. Heineman, J\. D\. Hwang, L\. Soldaini, A\. Bhagia, J\. Liu, D\. Groeneveld, O\. Tafjord, N\. A\. Smith, P\. W\. Koh, and J\. Dodge \(2025\)DataDecide: how to predict best pretraining data with small experiments\.InProceedings of the 42nd International Conference on Machine Learning,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Proceedings of Machine Learning Research, Vol\.267,pp\. 42487–42502\.External Links:[Link](https://proceedings.mlr.press/v267/magnusson25a.html)Cited by:[Limitations](https://arxiv.org/html/2606.07597#Sx1.p2.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2017\)Pointer sentinel mixture models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Byj72udxe)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.SSS0.Px1.p1.1)\.
- B\. Miranda, A\. Lee, S\. Sundar, A\. Casasola, R\. Schaeffer, E\. Obbad, and S\. Koyejo \(2025\)Beyond scale: the diversity coefficient as a data quality metric for variability in natural language data\.External Links:2306\.13840,[Link](https://arxiv.org/abs/2306.13840)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1)\.
- N\. Muennighoff, A\. M\. Rush, B\. Barak, T\. Le Scao, A\. Piktus, N\. Tazi, S\. Pyysalo, T\. Wolf, and C\. Raffel \(2023\)Scaling data\-constrained language models\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.External Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3668313)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.07597#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.SSS0.Px1.p2.1),[§4\.1](https://arxiv.org/html/2606.07597#S4.SS1.SSS0.Px4.p2.1),[§4\.1](https://arxiv.org/html/2606.07597#S4.SS1.p2.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p4.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p5.1)\.
- \[24\]National Center for Biotechnology Information \(NCBI\)\(\)PubMed\.External Links:[Link](https://pubmed.ncbi.nlm.nih.gov/)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.SSS0.Px2.p1.1)\.
- G\. Penedo, H\. Kydlíček, L\. B\. Allal, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. Von Werra, and T\. Wolf \(2024\)The FineWeb datasets: decanting the web for the finest text data at scale\.InProceedings of the 38th International Conference on Neural Information Processing Systems,NIPS ’24,Red Hook, NY, USA\.External Links:ISBN 9798331314385,[Link](https://dl.acm.org/doi/10.5555/3737916.3738886)Cited by:[§A\.3](https://arxiv.org/html/2606.07597#A1.SS3.p2.1),[§1](https://arxiv.org/html/2606.07597#S1.p4.1),[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.SSS0.Px3.p1.1)\.
- G\. Penedo, Q\. Malartic, D\. Hesslow, R\. Cojocaru, H\. Alobeidli, A\. Cappelli, B\. Pannier, E\. Almazrouei, and J\. Launay \(2023\)The RefinedWeb dataset for falcon LLM: outperforming curated corpora with web data only\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.External Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3669586)Cited by:[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.SSS0.Px3.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.J\. Mach\. Learn\. Res\.21\(1\)\.External Links:ISSN 1532\-4435,[Link](https://dl.acm.org/doi/10.5555/3455716.3455856)Cited by:[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.SSS0.Px3.p1.1)\.
- M\. Shukor, L\. Bethune, D\. Busbridge, D\. Grangier, E\. Fini, A\. El\-Nouby, and P\. Ablin \(2025\)Scaling laws for optimal data mixtures\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 129554–129579\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/bc1d640f841f752c689aae20b31198c1-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.07597#S2.SS3.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p3.1)\.
- J\. Su, M\. Ahmed, Y\. Lu, S\. Pan, W\. Bo, and Y\. Liu \(2024\)RoFormer: enhanced transformer with rotary position embedding\.Neurocomput\.568\(C\)\.External Links:ISSN 0925\-2312,[Link](https://doi.org/10.1016/j.neucom.2023.127063),[Document](https://dx.doi.org/10.1016/j.neucom.2023.127063)Cited by:[§B\.1](https://arxiv.org/html/2606.07597#A2.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.07597#S3.SS2.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar, A\. Rodriguez, A\. Joulin, E\. Grave, and G\. Lample \(2023\)LLaMA: open and efficient foundation language models\.CoRRabs/2302\.13971\.External Links:[Link](https://doi.org/10.48550/arXiv.2302.13971),[Document](https://dx.doi.org/10.48550/ARXIV.2302.13971),2302\.13971Cited by:[§5](https://arxiv.org/html/2606.07597#S5.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p5.1)\.
- M\. Weber, D\. Y\. Fu, Q\. Anthony, Y\. Oren, S\. Adams, A\. Alexandrov, X\. Lyu, H\. Nguyen, X\. Yao, V\. Adams, B\. Athiwaratkun, R\. Chalamala, K\. Chen, M\. Ryabinin, T\. Dao, P\. Liang, C\. Ré, I\. Rish, and C\. Zhang \(2024\)RedPajama: an open dataset for training large language models\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/d34497330b1fd6530f7afd86d0df9f76-Abstract-Datasets%5C_and%5C_Benchmarks%5C_Track.html)Cited by:[Limitations](https://arxiv.org/html/2606.07597#Sx1.p5.1)\.
- S\. M\. Xie, H\. Pham, X\. Dong, N\. Du, H\. Liu, Y\. Lu, P\. Liang, Q\. V\. Le, T\. Ma, and A\. W\. Yu \(2023\)DoReMi: optimizing data mixtures speeds up language model pretraining\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.External Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3669181)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.07597#S2.SS3.p1.1)\.
- F\. Xue, Y\. Fu, W\. Zhou, Z\. Zheng, and Y\. You \(2023\)To repeat or not to repeat: insights from scaling llm under token\-crisis\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.External Links:[Link](https://dl.acm.org/doi/10.5555/3666122.3668712)Cited by:[§2\.2](https://arxiv.org/html/2606.07597#S2.SS2.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p5.1)\.
- Y\. Yang, C\. Wang, and J\. Li \(2025\)UMoE: unifying attention and FFN with shared experts\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 36988–37013\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/34bafcb1e8b7ee231c5a796e83d33f9b-Paper-Conference.pdf)Cited by:[§3\.1](https://arxiv.org/html/2606.07597#S3.SS1.p1.1)\.
- J\. Ye, P\. Liu, T\. Sun, J\. Zhan, Y\. Zhou, and X\. Qiu \(2025\)Data mixing laws: optimizing data mixtures by predicting language modeling performance\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=jjCB27TMK3)Cited by:[§1](https://arxiv.org/html/2606.07597#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.07597#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2606.07597#S2.SS3.p1.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p3.1),[Limitations](https://arxiv.org/html/2606.07597#Sx1.p4.1)\.

## Appendix ADataset Details

### A\.1WikiText

WikiText consists of full\-length articles, making it well\-suited for evaluating models on long\-range dependencies\. Performance on the validation set thus partly reflects the model’s ability to capture relationships across longer spans within a document\.

### A\.2PubMed

We select PubMed as a second high\-quality domain for two reasons\. First, the text comes from published academic articles that undergo rigorous review, so the samples are consistently well\-written, comparable in quality to theGood and FeaturedWikiText articles\. Second, PubMed is domain\-specific: biomedical literature contains specialized terminology in physiology, medicine, and related fields that is rare in general discourse\. This combination of high quality and domain specificity makes PubMed a useful complement to WikiText\.

The Hugging Face PubMed dataset does not provide a predefined train/validation split, so we hold out approximately1%1\\%of documents for validation\. The full corpus contains 6,435,414,914 training tokens and 65,039,475 validation tokens\. To maintain comparable dataset sizes with WikiText, we sample abstracts until the training set reaches approximately 120 million tokens \(120,000,060\), and construct a validation set of 200,191 tokens\. As noted in the main paper, the number of validation tokens used in evaluation is fixed at 131,072\.

### A\.3FineWeb

FineWeb is a large\-scale dataset of 18\.5 trillion tokens of cleaned and deduplicated web crawl data from Common Crawl555[https://commoncrawl\.org](https://commoncrawl.org/)\. We use the FineWeb\-10BT subset, a random sample of approximately 10 billion tokens, as described in Section[3\.1](https://arxiv.org/html/2606.07597#S3.SS1)\.

We chose FineWeb\-10BT over alternatives such as FineWeb\-EduPenedoet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib57)\)because the web crawl data in our experiments serves primarily as a source of regularization and generalizability\. The more general FineWeb\-10BT corpus aligns with this role and maintains a clear quality contrast with WikiText and PubMed\.

## Appendix BModel Details

### B\.1NanoGPT

NanoGPTKarpathy \([2022](https://arxiv.org/html/2606.07597#bib.bib90)\)provides a streamlined framework for training medium\-sized GPT models\. Themodded\-nanogptrepositoryJordanet al\.\([2024a](https://arxiv.org/html/2606.07597#bib.bib19)\)hosts a speedrun challenge where practitioners train a language model to reach a target loss on FineWeb as quickly as possible\.

We use a version that adds two modifications to the base GPT\-2 architecture: the Muon optimizerJordanet al\.\([2024b](https://arxiv.org/html/2606.07597#bib.bib22)\)and Rotary Positional Embeddings \(RoPE\)Suet al\.\([2024](https://arxiv.org/html/2606.07597#bib.bib21)\)\. Muon \(MomentUm Orthogonalized by Newton\-Schulz\) applies a Newton\-Schulz matrix iterationBernstein and Newhouse \([2024](https://arxiv.org/html/2606.07597#bib.bib54)\)to SGD\-momentum updates\. In our setup, Muon optimizes the two\-dimensional weight matrices in the hidden layers, while AdamW handles the remaining parameters \(embedding layer, final fully connected layer\)\. RoPE encodes relative positions by rotating query and key vectors, improving training efficiency and robustness\.

### B\.2Model Sizes

We use four model sizes, obtained by scaling the number of layers and embedding dimensions\. The 124M model follows the originalmodded\-nanogptconfiguration \(1212layers,768768\-dimensional embeddings, 123,532,032 parameters\)\. The 30M model scales both down by50%50\\%\(66layers,384384dimensions, 29,915,520 parameters\)\. The 345M model scales both up by50%50\\%\(1818layers,11521152dimensions, 344,550,528 parameters\)\. The 757M model uses2424layers and15361536\-dimensional embeddings \(756,672,000 parameters\)\.

High\-QualityDatasetModelSizeTrainingTokensOptimalMixing RatioWikiText30M234M\(0\.00, 1\.00\)468M\(0\.05, 0\.95\)935M\(0\.10, 0\.90\)1\.87B\(0\.25, 0\.75\)3\.74B \(Target\)\(0\.25, 0\.75\)124M234M\(0\.00, 1\.00\)468M\(0\.25, 0\.75\)935M\(0\.40, 0\.60\)1\.87B\(0\.55, 0\.45\)3\.74B \(Target\)\(0\.65, 0\.35\)345M234M\(0\.05, 0\.95\)468M\(0\.35, 0\.65\)935M\(0\.60, 0\.40\)1\.87B\(0\.70, 0\.30\)3\.74B \(Target\)\(0\.80, 0\.20\)757M234M\(0\.10, 0\.90\)468M\(0\.40, 0\.60\)935M\(0\.65, 0\.35\)1\.87B\(0\.75, 0\.25\)3\.74B \(Target\)\(0\.85, 0\.15\)PubMed30M240M\(0\.00, 1\.00\)480M\(0\.05, 0\.95\)960M\(0\.10, 0\.90\)1\.92B\(0\.20, 0\.80\)3\.84B \(Target\)\(0\.30, 0\.70\)124M240M\(0\.00, 1\.00\)480M\(0\.15, 0\.85\)960M\(0\.40, 0\.60\)1\.92B\(0\.55, 0\.45\)3\.84B \(Target\)\(0\.65, 0\.35\)345M240M\(0\.05, 0\.95\)480M\(0\.30, 0\.70\)960M\(0\.55, 0\.45\)1\.92B\(0\.70, 0\.30\)3\.84B \(Target\)\(0\.80, 0\.20\)757M240M\(0\.15, 0\.85\)480M\(0\.40, 0\.60\)960M\(0\.60, 0\.40\)1\.92B\(0\.75, 0\.25\)3\.84B \(Target\)\(0\.80, 0\.20\)

Table 3:Optimal mixing ratios by high\-quality dataset, model size, and training token budget for the two\-source scaling laws experiments\.High\-QualityDatasetModelSizeTrainingTokensOptimalMixing RatioWikiText30M234M\(0\.80, 0\.20\)468M\(0\.75, 0\.25\)935M\(0\.60, 0\.40\)1\.87B\(0\.50, 0\.50\)3\.74B \(Target\)\(0\.25, 0\.75\)124M234M\(0\.85, 0\.15\)468M\(0\.85, 0\.15\)935M\(0\.80, 0\.20\)1\.87B\(0\.75, 0\.25\)3\.74B \(Target\)\(0\.65, 0\.35\)345M234M\(0\.90, 0\.10\)468M\(0\.90, 0\.10\)935M\(0\.85, 0\.15\)1\.87B\(0\.85, 0\.15\)3\.74B \(Target\)\(0\.80, 0\.20\)757M234M\(0\.90, 0\.10\)468M\(0\.90, 0\.10\)935M\(0\.90, 0\.10\)1\.87B\(0\.85, 0\.15\)3\.74B \(Target\)\(0\.80, 0\.20\)PubMed30M241M\(0\.80, 0\.20\)481M\(0\.70, 0\.30\)958M\(0\.55, 0\.45\)1\.92B\(0\.45, 0\.55\)3\.84B \(Target\)\(0\.30, 0\.70\)124M241M\(0\.85, 0\.15\)481M\(0\.85, 0\.15\)958M\(0\.80, 0\.20\)1\.92B\(0\.75, 0\.25\)3\.84B \(Target\)\(0\.65, 0\.35\)345M241M\(0\.90, 0\.10\)481M\(0\.90, 0\.10\)958M\(0\.85, 0\.15\)1\.92B\(0\.85, 0\.15\)3\.84B \(Target\)\(0\.80, 0\.20\)757M241M\(0\.90, 0\.10\)481M\(0\.90, 0\.10\)958M\(0\.90, 0\.10\)1\.92B\(0\.85, 0\.15\)3\.84B \(Target\)\(0\.80, 0\.20\)

Table 4:Optimal mixing ratios by high\-quality dataset, model size, and training token budget for the two\-source repeat\-aware experiments\.

## Appendix CTraining Details

### C\.1Hyperparameters

We use a batch size of128128and a sequence length of256256to fit within GPU memory constraints\. For the learning rate schedule, we apply a linear decay over the final18006200×\(number of iterations\)\\frac\{1800\}\{6200\}\\times\\text\{\(number of iterations\)\}steps, following themodded\-nanogptconvention\. For sweeps, we typically train with three learning rates, evenly and logarithmically spaced\. Exceptions occur for the 757M model and the longest horizons of the 124M and 345M models, where only a single learning rate is used due to resource constraints\. Subsequent sweep ranges are informed by previous results; for example, if0\.001410\.00141is optimal in\[0\.00141,0\.002,0\.00282\]\[0\.00141,0\.002,0\.00282\], the next horizon may use\[0\.001,0\.00141,0\.002\]\[0\.001,0\.00141,0\.002\], reflecting the trend that optimal learning rates decrease for longer horizons\. Across our sweeps, the optimal learning rate at a given horizon was largely stable across mixing ratios, indicating that the mixing\-ratio and learning\-rate axes are largely decoupled\.

### C\.2Compute Resources

All experiments are conducted on NVIDIA A100\-SXM4\-80GB GPUs\. The random seed is fixed at 42 for both data pre\-processing and model training\.

### C\.3Mixing Ratios

The mixing ratio specifies the proportion of training tokens from each data source, enforced at the batch level\.

For each horizon, we sweep over mixing ratios in increments of0\.050\.05until the optimal ratio is identified\. We occasionally deviate from0\.050\.05increments when an exploratory run at a coarser spacing clearly outperforms the previous ratio, in which case intermediate values would not meaningfully affect the search\. We stop the sweep once a U\-shaped curve in validation loss emerges, defined as a ratio that outperforms the ratios on either side\. Across experiments, the validation loss consistently decreases monotonically until the optimum and then increases, confirming this as a reliable stopping criterion\.

![Refer to caption](https://arxiv.org/html/2606.07597v1/x4.png)Figure 4:Optimal number of PubMed repetitions across training horizons for the 30M, 124M, 345M, and 757M model experiments\.

## Appendix DAdditional Results

### D\.1Two\-Source Mixtures

Tables[3](https://arxiv.org/html/2606.07597#A2.T3)and[4](https://arxiv.org/html/2606.07597#A2.T4)present the optimal mixing ratios for all scaling laws and repeat\-aware experiments, respectively\. Figure[4](https://arxiv.org/html/2606.07597#A3.F4)shows the optimal number of PubMed repetitions across model sizes, as discussed in Section[4\.1](https://arxiv.org/html/2606.07597#S4.SS1)\.

### D\.2Three\-Source Mixtures

Mixing ratios are reported as proportions of FineWeb, WikiText, and PubMed, with the best result across learning rates for each ratio\. The “Experiment Type” column indicates whether the ratio corresponds to a baseline, the tuning procedure, or a multi\-horizon prediction\. For the 124M model, Tables[6](https://arxiv.org/html/2606.07597#A5.T6),[7](https://arxiv.org/html/2606.07597#A5.T7),[8](https://arxiv.org/html/2606.07597#A5.T8), and[9](https://arxiv.org/html/2606.07597#A5.T9)present results for the116\\frac\{1\}\{16\},18\\frac\{1\}\{8\},14\\frac\{1\}\{4\}, and12\\frac\{1\}\{2\}subsamples; Tables[11](https://arxiv.org/html/2606.07597#A5.T11),[12](https://arxiv.org/html/2606.07597#A5.T12),[13](https://arxiv.org/html/2606.07597#A5.T13), and[14](https://arxiv.org/html/2606.07597#A5.T14)present the corresponding results for the 757M model\. Target\-horizon results for both models are combined in Table[10](https://arxiv.org/html/2606.07597#A5.T10)\.

## Appendix ECompute Cost Details

Horizons UsedTokens per Run% of Target1232M6\.2%2700M18\.7%31\.64B43\.7%43\.50B93\.7%Target3\.74B100%Table 5:Cumulative training tokens across horizons for the WikiText two\-source experiments\.Table[5](https://arxiv.org/html/2606.07597#A5.T5)reports cumulative training tokens across horizons for the WikiText two\-source experiments in absolute terms\. Each horizon involves multiple training runs \(one per mixing ratio and learning rate\), so the total experimental cost at a given horizon is the token budget shown here multiplied by the number of runs\. Since both methods use the same sweep procedure, this multiplier is the same for both, and the relative cost comparison holds\.

#### Two\-source setting\.

A single repetition\-controlled horizon at6%6\\%of the target budget can replace a multi\-horizon scaling laws analysis consuming4444to94%94\\%of it, since both methods perform the same number of runs per horizon\. This advantage would scale with the target training budget if the per\-horizon proportions held: at a 1 trillion token target, the same 6% vs 94% split would correspond to roughly 62 billion tokens per run for a single repetition\-controlled horizon, against around 940 billion for a four\-horizon scaling laws analysis\. Whether these proportions hold at this scale is an empirical question beyond the range of our experiments\.

#### Three\-source setting\.

At the 757M scale, two repetition\-controlled horizons \(roughly19%19\\%of the target budget\) recover the optimal mixture at sweep granularity, matching or outperforming baselines whose construction requires the full two\-source experiments\. At the 124M scale, repetition\-controlled predictions from two to four horizons all outperform both baselines from two\-source results, with the four\-horizon prediction effectively reaching the optimum \(loss2\.919502\.91950vs\.2\.918202\.91820\), though at substantially higher cost\. Even at this smaller scale, two horizons suffice to beat both baselines, suggesting that repetition control still substantially reduces the experimental budget needed when more data sources are involved\.

MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.7, 0\.15, 0\.150\.001413\.52295Baseline 10\.75, 0\.125, 0\.1250\.001413\.50460Tuned0\.75, 0\.1, 0\.150\.001413\.51945Tuned0\.75, 0\.15, 0\.10\.001413\.51950Tuned0\.8, 0\.1, 0\.10\.001413\.51725Tuned0\.85, 0\.075, 0\.0750\.0023\.54890Baseline 20\.9, 0\.05, 0\.050\.001413\.62840Tuned

Table 6:Three\-source repeat\-aware results with a1/161/16subsample for the 124M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.45, 0\.275, 0\.2750\.001413\.66295Tuned0\.5, 0\.25, 0\.250\.001413\.53470Tuned0\.55, 0\.225, 0\.2750\.001413\.44795Tuned0\.6, 0\.2, 0\.20\.001413\.38440Tuned0\.65, 0\.175, 0\.1750\.001413\.34305Tuned0\.7, 0\.15, 0\.150\.001413\.32235Baseline 10\.7, 0\.2, 0\.10\.001413\.35140Tuned0\.7, 0\.1, 0\.20\.001413\.37140Tuned0\.75, 0\.125, 0\.1250\.001413\.33015Tuned0\.85, 0\.075, 0\.0750\.001413\.38965Baseline 2

Table 7:Three\-source repeat\-aware results with a1/81/8subsample for the 124M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.5, 0\.25, 0\.250\.001413\.23315Tuned0\.6, 0\.2, 0\.20\.001413\.17525Baseline 10\.65, 0\.175, 0\.1750\.001413\.16845Tuned0\.7, 0\.15, 0\.150\.001413\.16900Tuned0\.7, 0\.2, 0\.10\.001413\.18480Tuned0\.7, 0\.1, 0\.20\.001413\.20185Tuned0\.75, 0\.125, 0\.1250\.001413\.19185Tuned0\.8, 0\.1, 0\.10\.001413\.21795Baseline 2

Table 8:Three\-source repeat\-aware results with a1/41/4subsample for the 124M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.45, 0\.275, 0\.2750\.0013\.05655Tuned0\.5, 0\.25, 0\.250\.0013\.03700Baseline 10\.55, 0\.225, 0\.2250\.0013\.03345Tuned0\.55, 0\.275, 0\.1750\.0013\.04290Tuned0\.55, 0\.175, 0\.2750\.0013\.05045Tuned0\.6, 0\.2, 0\.20\.0013\.03565Tuned0\.75, 0\.125, 0\.1250\.0013\.08405Baseline 2

Table 9:Three\-source repeat\-aware results with a1/21/2subsample for the 124M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.MixingRatioLearningRateAverageValidation LossExperimentType124M Model0\.3, 0\.35, 0\.350\.0012\.94270Baseline 10\.4, 0\.3, 0\.30\.0012\.92300Tuned0\.45, 0\.25, 0\.30\.0012\.91820Tuned0\.45, 0\.3, 0\.250\.0012\.91915Tuned0\.45, 0\.275, 0\.2750\.0012\.91935Tuned0\.5, 0\.25, 0\.250\.0012\.92115Tuned0\.51, 0\.245, 0\.2450\.0012\.91950Four\-Horizon Prediction0\.56, 0\.22, 0\.220\.0012\.92830Three\-Horizon Prediction0\.57, 0\.215, 0\.2150\.0012\.92965Two\-Horizon Prediction0\.6, 0\.2, 0\.20\.001412\.94150Tuned0\.65, 0\.175, 0\.1750\.0012\.95570Baseline 20\.75, 0\.125, 0\.1250\.001413\.01300Tuned757M Model0\.55, 0\.225, 0\.2250\.0012\.81550Tuned0\.6, 0\.2, 0\.20\.0012\.78805Tuned0\.65, 0\.175, 0\.1750\.0012\.76990Two\-Horizon Prediction0\.65, 0\.15, 0\.20\.0012\.77510Baseline 10\.7, 0\.15, 0\.150\.0012\.77015Tuned0\.7, 0\.2, 0\.10\.0012\.78765Tuned0\.7, 0\.1, 0\.20\.0012\.80405Tuned0\.75, 0\.125, 0\.1250\.0012\.78580Tuned0\.825, 0\.075, 0\.10\.0012\.83365Baseline 20\.85, 0\.075, 0\.0750\.0012\.85175Single\-Horizon Prediction

Table 10:Three\-source results at the full training horizon for the 124M and 757M models\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run per model inbold\.MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.7, 0\.15, 0\.150\.0013\.68130Tuned0\.75, 0\.125, 0\.1250\.0013\.48265Tuned0\.8, 0\.1, 0\.10\.0013\.39890Tuned0\.85, 0\.075, 0\.0750\.0013\.38515Tuned0\.85, 0\.1, 0\.050\.0013\.40070Tuned0\.85, 0\.05, 0\.10\.0013\.41555Tuned0\.9, 0\.05, 0\.050\.0013\.44425Tuned

Table 11:Three\-source repeat\-aware results with a1/161/16subsample for the 757M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.5, 0\.25, 0\.250\.0014\.49960Tuned0\.6, 0\.2, 0\.20\.0013\.86945Tuned0\.7, 0\.15, 0\.150\.0013\.37725Tuned0\.8, 0\.1, 0\.10\.0013\.20075Tuned0\.8, 0\.15, 0\.050\.0013\.30435Tuned0\.8, 0\.05, 0\.150\.0013\.31390Tuned0\.85, 0\.075, 0\.0750\.0013\.20810Tuned0\.9, 0\.05, 0\.050\.0013\.27125Tuned

Table 12:Three\-source repeat\-aware results with a1/81/8subsample for the 757M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.7, 0\.15, 0\.150\.0013\.09955Tuned0\.75, 0\.125, 0\.1250\.0013\.04690Tuned0\.8, 0\.1, 0\.10\.0013\.03955Tuned0\.8, 0\.15, 0\.050\.0013\.09650Tuned0\.8, 0\.05, 0\.150\.0013\.11155Tuned0\.85, 0\.075, 0\.0750\.0013\.06200Tuned

Table 13:Three\-source repeat\-aware results with a1/41/4subsample for the 757M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.MixingRatioLearningRateAvg\.ValidationLossExperimentType0\.65, 0\.175, 0\.1750\.0012\.93330Tuned0\.7, 0\.15, 0\.150\.0012\.90015Tuned0\.75, 0\.125, 0\.1250\.0012\.89195Tuned0\.75, 0\.175, 0\.0750\.0012\.93055Tuned0\.75, 0\.075, 0\.1750\.0012\.94015Tuned0\.8, 0\.1, 0\.10\.0012\.90955Tuned

Table 14:Three\-source repeat\-aware results with a1/21/2subsample for the 757M model\. Mixing ratios are proportions of FineWeb, WikiText, and PubMed\. Best run inbold\.

Similar Articles

Scaling Laws for Mixture Pretraining Under Data Constraints

arXiv cs.LG

This paper studies the trade-off between scarce target data and abundant generic data in mixture pretraining, finding that repetition is a key driver of performance and that mixture training tolerates 15-20 repetitions of target data. It introduces a repetition-aware scaling law to optimize mixture configurations under data constraints.

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

arXiv cs.CL

This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv cs.CL

This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

arXiv cs.LG

This paper investigates the 'small-vs-large gap', where training on fewer samples with more repetitions can lead to faster learning and compute savings compared to using larger datasets, attributing the speedup to layer-wise growth enabled by sampling biases. The findings suggest that smaller datasets with repetition can be proactively leveraged as favorable inductive biases, particularly in reasoning tasks.