Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

arXiv cs.CL 05/18/26, 04:00 AM Papers
data-mixing language-models pretraining continual-learning low-rank-adapters on-policy efficient-training
Summary
This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.
arXiv:2605.15220v1 Announce Type: new Abstract: Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.
Original Article
View Cached Full Text
Cached at: 05/18/26, 06:30 AM
# Efficient and Simple Data Mixing All The Time
Source: [https://arxiv.org/html/2605.15220](https://arxiv.org/html/2605.15220)
## Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Michael Y\. Hu1Apurva Gandhi2 Kyunghyun Cho1Tal Linzen1Pratyusha Sharma1,3 1New York University2Carnegie Mellon University3Microsoft \{michael\.hu, kyunghyun\.cho, linzen\}@nyu\.edu apurvag@andrew\.cmu\.edupratysharma@microsoft\.com

###### Abstract

Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training\. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired\. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether\.*We argue that data mixing is fundamentally an online decision making problem—one that recurs throughout training and demands a single, unified solution\.*We introduceOP\-Mix\(On\-Policy Mix\), a data mixing algorithm that operates across the entire language model training lifecycle\. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low\-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model’s actual learning dynamics\. Across pretraining, continual midtraining, and continual instruction tuning,OP\-Mixconsistently finds near\-optimal mixtures while using a fraction of the compute of the baselines\. In pretraining,OP\-Miximproves upon training without mixing by 6\.3% in average perplexity\. For continual learning,OP\-Mixmatches the performance of both retraining and on\-policy distillation while using 66% and 95% less overall compute, respectively\.OP\-Mixsuggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data\.

![Refer to caption](https://arxiv.org/html/2605.15220v1/x1.png)Figure 1:Overview ofOP\-Mix\.OP\-Mixaims to cheaply estimate optimal data mixing ratios in a continual setting\.\(1\)Train a lightweight LoRA adapter on new domains to estimate future performance\.\(2\)Interpolate adapters to simulate different data mixtures without retraining and then estimate the optimal mixture ratio\.\(3\)Train the base model with the computed mixture\.![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/efficiency_figure.png)Figure 2:OP\-Mix\(purple\) Pareto\-dominates the performance\-efficiency frontierwhen tested alongside baselines across pretraining, continual midtraining, and continual instruction tuning\.## 1Introduction

Language models are trained on carefully curated data mixtures, yet the science of constructing the right mixture remains nascent\. The dominant approach—training small proxy models on candidate mixtures and extrapolating to full training—is combinatorially expensive and scales poorly as the number of domains grows\(Yeet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib2); Liuet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib26); Chenet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib1)\)\. Furthermore, most data mixing approaches are specialized towards pretraining and assume a fixed domain set\(Chenet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib29); Fanet al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib27); Jianget al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib53); Xieet al\.,[2023](https://arxiv.org/html/2605.15220#bib.bib33); Chenet al\.,[2023](https://arxiv.org/html/2605.15220#bib.bib23)\): in practice, available training domains evolve continuously as new tasks are defined, new corpora are collected, and new capabilities are prioritized\. This induces a natural continual learning problem, where the goal is to incorporate new data without catastrophically forgetting what the model has already learned\. We ask:

*What is the right data mix, and how do we find it efficiently as the data itself keeps changing?*

We proposeOP\-Mix\(On\-Policy Mix\), an algorithm that estimates optimal data mixtures by combining two insights\. First, rather than train separate proxy models for each candidate data mixture,OP\-Mixtrains a single low\-rank adapter \(LoRA,Huet al\.\([2022](https://arxiv.org/html/2605.15220#bib.bib60)\)\) per data domain directly from the current model, keeping the proxy modelon\-policywith the model being trained—i\.e\., reflective of its current state\. Second, it uses linear interpolation between LoRAs as a proxy for the loss surface of full data mixing, following recent works\(Wanget al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib59); Taoet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib61)\)\. This bypasses the need to retrain proxies for every different data mix ratio, escaping the combinatorial explosion of training runs\. These two insights allowOP\-Mixto search over data mixtures with minimal additional compute, no separate proxy models, and natural accommodation to new domains: when a new dataset arrives, we simply train another LoRA and re\-fit the mixture\.

We evaluateOP\-Mixacross three stages of the language model lifecycle—pretraining\(Radfordet al\.,[2019](https://arxiv.org/html/2605.15220#bib.bib9); Devlinet al\.,[2019](https://arxiv.org/html/2605.15220#bib.bib10)\), continual midtraining\(OLMoet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib21); Liuet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib12)\), and continual instruction tuning\(Weiet al\.,[2022](https://arxiv.org/html/2605.15220#bib.bib11)\)—and find that our single algorithm suffices for all three\. In pretraining,OP\-Miximproves over no data mixing by 6\.3% in average perplexity and matches the best data mixing baseline’s performance while using 14% less compute\. In continual midtraining,OP\-Mixachieves the performance offull retrainingat a fraction of the cost\. Finally, in continual instruction tuning,OP\-Mixcomposes with on\-policy self\-distillation\(Shenfeldet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib57); Lu,[2025](https://arxiv.org/html/2605.15220#bib.bib58); Zhaoet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib45)\), yielding further gains without modifications to either algorithm\. Our contributions are as follows:

1. 1\.The first universal data mixing algorithm:OP\-Mixis the first data mixing algorithm that both expands to new data domains and simulates candidate mixtures without separate proxy models\. This enablesOP\-Mixto continually mix data even as the data evolves, overcoming the need for a different algorithm at each phase of the training pipeline \(§[3](https://arxiv.org/html/2605.15220#S3)\)\.
2. 2\.State\-of\-the\-art across the entire training lifecycle:A single instantiation ofOP\-Mixachieves state\-of\-the\-art performance in pretraining, continual midtraining, and continual instruction tuning, demonstrating that phase\-specific algorithms are unnecessary\. \(§[4](https://arxiv.org/html/2605.15220#S4)\)\.
3. 3\.OP\-Mixenables continual learning, matching on\-policy distillation with 95% less compute:Applied atop standard SFT during continual instruction tuning,OP\-Mixrecovers the gains of self\-distillation finetuning \(SDFT,Shenfeldet al\.\([2026](https://arxiv.org/html/2605.15220#bib.bib57)\)\) at a fraction of the cost\. CombiningOP\-Mixwith SDFT also yields further gains, suggesting that data mixing can be an independent axis of improvement from training objective \(§[4\.1](https://arxiv.org/html/2605.15220#S4.SS1)\)\.

## 2Background: Data Mixing and Its Limitations

Let𝒟=\{D1,D2,…,Dm\}\\mathcal\{D\}=\\\{D\_\{1\},D\_\{2\},\\dots,D\_\{m\}\\\}be a set ofmmdata domains, where domainDiD\_\{i\}hasNiN\_\{i\}tokens\. A data mixture is a probability vectorp∈△m−1p\\in\\triangle^\{m\-1\}, where training onRRtotal tokens usespi⋅Rp\_\{i\}\\cdot Rtokens from domainDiD\_\{i\}\. We denote a language model ofSSparameters trained forRRtokens on mixtureppasLM\(S,R,p\)\\text\{LM\}\(S,R,p\)and measure its performance on downstream taskj∈\[J\]j\\in\[J\]asfj\(LM\(S,R,p\)\)f\_\{j\}\(\\text\{LM\}\(S,R,p\)\)\. We assume the training objective is to minimize a weighted sumF=∑jwj⋅fj\(LM\(S,R,p\)\)F=\\sum\_\{j\}w\_\{j\}\\cdot f\_\{j\}\(\\text\{LM\}\(S,R,p\)\), where weightswjw\_\{j\}are user\-specified\. Here, metrics intended to be maximized \(e\.g\., accuracy\) are negated\.

#### Batch continual learning\.

During training, we may periodically receivekknew datasetsDm\+1,…,Dm\+kD\_\{m\+1\},\\dots,D\_\{m\+k\}, in which case the updated domain set becomes𝒟∪\{Dm\+1,…,Dm\+k\}\\mathcal\{D\}\\cup\\\{D\_\{m\+1\},\\dots,D\_\{m\+k\}\\\}\. For example, thesekknew datasets may be instruction fine\-tuning datasets, introduced after pretraining\. We may then aim to minimize the loss across both pretraining and instruction tuning datasets\.

#### Data mixing\.

Data mixing algorithms automate the process of finding the mixtureppthat minimizesFF\. The core idea in most data mixing algorithms is fitting a simple modelfi^\(p\)\\hat\{f\_\{i\}\}\(p\)that predicts the future performancefif\_\{i\}as a function of the performance on the current data mixturepp\(seeChenet al\.\([2025](https://arxiv.org/html/2605.15220#bib.bib29)\)for review\)\. One can then minimizefi^\(p\)\\hat\{f\_\{i\}\}\(p\)to estimate an optimal mixture\.

Table 1:OP\-Mixis the only method that expands the data mixture to new data while not using separate proxy models\.The combination of these two features allowsOP\-Mixto be deployed across the language model lifecycle\.Previous work has shown that future performance is well\-predicted by a log\-linear parametric form:fi^\(p\)=ci\+exp⁡\(Ai⊤pi\)\\hat\{f\_\{i\}\}\(p\)=c\_\{i\}\+\\exp\{\(A\_\{i\}^\{\\top\}p\_\{i\}\)\}, whereci∈ℝc\_\{i\}\\in\\mathbb\{R\}andAi∈ℝmA\_\{i\}\\in\\mathbb\{R\}^\{m\}\(Yeet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib2); Chenet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib29),[2026](https://arxiv.org/html/2605.15220#bib.bib1)\)\. Data mixing algorithms then aim to estimate such a scaling law as cheaply as possible\. A common technique is to fit the scaling law by randomly sampling mixtures from the probability simplex and training proxy models with fewer parametersS′≪SS^\{\\prime\}\\ll Sand less dataR′≪RR^\{\\prime\}\\ll Rto approximate the full model’s performance on a data mixture:fi\(LM\(S,R,p\)\)≈fi\(LM\(S′,R′,p\)\)f\_\{i\}\(\\text\{LM\}\(S,R,p\)\)\\approx f\_\{i\}\(\\text\{LM\}\(S^\{\\prime\},R^\{\\prime\},p\)\)\(Liuet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib26); Chenet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib1); Yeet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib2)\)\.

A single data mixing algorithm that spans all of LM training is desirable for both practical reasons \(less complexity and phase\-specific tuning\) and conceptual ones: pretraining, midtraining, and finetuning are not fundamentally different problems for data mixing\. However, two issues limit existing algorithms from operating across this lifecycle\. First, most data mixing algorithms, being targeted towards pretraining, do not expand their data mixtures\. It follows that these algorithms cannot be applied to the continual learning setting, and yet language model training induces a natural continual learning problem from phase to phase\. Second, data mixing methods that rely on separate proxy models are defunct after pretraining, as open\-source model releases typically do not come with a matching small\-model proxy\(Teamet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib20); Grattafioriet al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib24); Yanget al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib22)\)\. Moreover, separate smaller proxy models have been shown to yield suboptimal mixtures for the target model, as they diverge from the base model’s dynamics at scale\(Jianget al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib53); Chenet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib1)\)and the number of proxies explodes combinatorially with the number of datasets\.

![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/clm_dynamics_ord1_norm_530M_plain.png)Figure 3:OP\-Mixenables continual learning:For a 530M parameter model,OP\-Mixmitigates forgetting 27% better on average and 71% better on Reddit than Continual SFT with WSD\-S learning schedule, a method specifically designed for continual learning\(Wenet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib47)\)\.

## 3OP\-Mix: On\-Policy Data Mixing

In this work, we proposeOP\-Mix, a data mixing algorithm that works effectively for any stage of language model training by using on\-policy proxies instead of separate proxies and efficiently expanding data mixtures\.*On\-policy*here means that the proxy is built from the model being trained, rather than a separately initialized model whose learning dynamics may diverge from its target\.OP\-Mixuses Low\-Rank Adaptation \(LoRA,Huet al\.\([2022](https://arxiv.org/html/2605.15220#bib.bib60)\)\) to cheaply estimate the performance of full training\. LoRA reduces the necessary compute for testing new data mixtures while being tied to the base model and circumvents the ambiguities of creating separate proxy models later in training\.

To simulate new data mixtures without performing additional training runs, we interpolate LoRA weights, as inspired byWanget al\.\([2026](https://arxiv.org/html/2605.15220#bib.bib59)\)andTaoet al\.\([2025](https://arxiv.org/html/2605.15220#bib.bib61)\)\. This allows us to train one LoRA per data domain and estimate the effect of mixing domains post\-hoc using only forward passes\. We also expand the data mixture when new domains arrive, taking inspiration from pretraining mixture reuse inChenet al\.\([2026](https://arxiv.org/html/2605.15220#bib.bib1)\)\. In each stage, instead of retraining a new LoRA foreverypreviously seen domain, we train a single “old” adapterθDoldLoRA\\theta\_\{D\_\{\\text\{old\}\}\}^\{\\text\{LoRA\}\}, keeping probabilities of old domains constant and only adjusting the ratio between the old mixture and incoming new domains\.

#### OP\-Mix\(Algorithm[1](https://arxiv.org/html/2605.15220#alg1)\)\.

In thecontinual setting, whenKKnew domainsDm\+i,i∈\[K\]D\_\{m\+i\},i\\in\[K\]arrive, we train a single LoRA adapter per domainDm\+iD\_\{m\+i\}, starting from the current model\. This gives usθDm\+iLoRA\\theta\_\{D\_\{m\+i\}\}^\{\\text\{LoRA\}\}, a cheap approximation of what full finetuning onDm\+iD\_\{m\+i\}would produce\. We also trainθDoldLoRA\\theta\_\{D\_\{\\text\{old\}\}\}^\{\\text\{LoRA\}\}on the old data to approximate continued training onDoldD\_\{\\text\{old\}\}\. Next, we evaluate linear interpolation merges ofθDm\+1LoRA,…,θDm\+KLoRA\\theta\_\{D\_\{m\+1\}\}^\{\\text\{LoRA\}\},\\dots,\\theta\_\{D\_\{m\+K\}\}^\{\\text\{LoRA\}\}andθDoldLoRA\\theta\_\{D\_\{\\text\{old\}\}\}^\{\\text\{LoRA\}\}\. We sample interpolation points in theKK\-simplex△K\\triangle^\{K\}, and each interpolation point simulates a different mixing ratio between old and new data without additional training\. We then fit a regression model to these evaluations, producing a smooth loss surface over the interpolation path \(Algorithm[1](https://arxiv.org/html/2605.15220#alg1), lines 7–12\)\. Finally, we minimize over this surface to obtainα⋆\\alpha^\{\\star\}, the tradeoff between old and new data, distribute the resulting weight across all datasets, and do the final training run\. See Figure[1](https://arxiv.org/html/2605.15220#S0.F1)for a visual overview\.

Forpretraining, we begin with a warmup phase in which every document is sampled with equal probability \(empirical risk minimization\)\. After warmup, we reintroduce each dataset as new domains to adjust the data mixture\. In §[4\.1](https://arxiv.org/html/2605.15220#S4.SS1), we set warmup to 20% of the overall token budget\.

Algorithm 1OP\-Mix\(single continual learning step\)1:Input:Base model

θbase\\theta\_\{\\text\{base\}\}; previous domains

\{D1,…,Dm\}\\\{D\_\{1\},\\dots,D\_\{m\}\\\}with mixture

pt−1∈△m−1p\_\{t\-1\}\\in\\triangle^\{m\-1\}; new domains

\{Dm\+1,…,Dm\+K\}\\\{D\_\{m\+1\},\\dots,D\_\{m\+K\}\\\}; mixture prior

μ∈△m\+K−1\\mu\\in\\triangle^\{m\+K\-1\}; search iterations

PP; regularization strength towards prior

λ\\lambda\.

2:Train LoRA adapter

θoldLoRA\\theta^\{\\text\{LoRA\}\}\_\{\\text\{old\}\}on the mixture

pt−1p\_\{t\-1\}, starting from

θbase\\theta\_\{\\text\{base\}\}\.

3:for

k∈\[K\]k\\in\[K\]do

4:Train LoRA adapter

θDm\+kLoRA\\theta^\{\\text\{LoRA\}\}\_\{D\_\{m\+k\}\}on

Dm\+kD\_\{m\+k\}, starting from

θbase\\theta\_\{\\text\{base\}\}\.

5:endfor

6:Define the mixture expansion

E:△K→△m\+K−1E:\\triangle^\{K\}\\to\\triangle^\{m\+K\-1\}by

E\(𝜶\)i=\{αoldpt−1\(Di\)i≤mαii\>m,𝜶=\(αold,αm\+1,…,αm\+K\)\.E\(\\boldsymbol\{\\alpha\}\)\_\{i\}\\;=\\;\\begin\{cases\}\\alpha\_\{\\text\{old\}\}\\,p\_\{t\-1\}\(D\_\{i\}\)&i\\leq m\\\\\[2\.0pt\] \\alpha\_\{i\}&i\>m,\\end\{cases\}\\qquad\\boldsymbol\{\\alpha\}=\(\\alpha\_\{\\text\{old\}\},\\alpha\_\{m\+1\},\\dots,\\alpha\_\{m\+K\}\)\.
7:for

p∈\[P\]p\\in\[P\]do

8:Sample

𝜶p∼△K\\boldsymbol\{\\alpha\}\_\{p\}\\sim\\triangle^\{K\}\.

9:Form the merged adapter

θ𝜶pLoRA←αoldθoldLoRA\+∑k=1Kαm\+kθDm\+kLoRA\.\\theta^\{\\text\{LoRA\}\}\_\{\\boldsymbol\{\\alpha\}\_\{p\}\}\\;\\leftarrow\\;\\alpha\_\{\\text\{old\}\}\\,\\theta^\{\\text\{LoRA\}\}\_\{\\text\{old\}\}\+\\sum\_\{k=1\}^\{K\}\\alpha\_\{m\+k\}\\,\\theta^\{\\text\{LoRA\}\}\_\{D\_\{m\+k\}\}\.
10:Evaluate per\-domain loss

yp,j=fj\(θ𝜶pLoRA\)forj=1,…,N\.y\_\{p,j\}=f\_\{j\}\\\!\\bigl\(\\theta^\{\\text\{LoRA\}\}\_\{\\boldsymbol\{\\alpha\}\_\{p\}\}\\bigr\)\\qquad\\text\{for \}j=1,\\dots,N\.
11:endfor

12:Fit log\-linear regressors

g^j\(𝜶\)\\hat\{g\}\_\{j\}\(\\boldsymbol\{\\alpha\}\)to

\{\(𝜶p,yp,j\)\}p=1P\\\{\(\\boldsymbol\{\\alpha\}\_\{p\},y\_\{p,j\}\)\\\}\_\{p=1\}^\{P\}\.

13:Solve the regularized mix optimization\.⊳\\trianglerightcvxpy\(Diamond and Boyd,[2016](https://arxiv.org/html/2605.15220#bib.bib51)\)

𝜶⋆=arg⁡min𝜶∈△K⁡1N∑j=1Ng^j\(𝜶\)\+λDKL\(E\(𝜶\)∥μ\)\.\\boldsymbol\{\\alpha\}^\{\\star\}\\;=\\;\\arg\\\!\\min\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}\\;\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\hat\{g\}\_\{j\}\(\\boldsymbol\{\\alpha\}\)\\;\+\\;\\lambda\\,D\_\{\\text\{KL\}\}\\\!\\bigl\(E\(\\boldsymbol\{\\alpha\}\)\\,\\big\\\|\\,\\mu\\bigr\)\.
14:Set the new mixture

pt←E\(𝜶⋆\)p\_\{t\}\\leftarrow E\(\\boldsymbol\{\\alpha\}^\{\\star\}\)and continue training

θbase\\theta\_\{\\text\{base\}\}on

ptp\_\{t\}\.

15:return

ptp\_\{t\}and the fine\-tuned model \(the next stage’s

θbase\\theta\_\{\\text\{base\}\}\)\.

## 4Experimental Results

We examineOP\-Mixin several settings: pretraining, continual midtraining, and continual instruction fine\-tuning\. In pretraining, we examineOP\-Mix’s ability to find a good pretraining mixture for fixed training corpora\. In midtraining\(OLMoet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib21)\), high quality datasets are upweighted relative to the original pretraining data mixture; here we continually finetune a pretrained model ladder from HuggingFace on successive reference datasets\. Finally, we applyOP\-Mixto the continual instruction tuning setting, where a language model is finetuned on successive question\-answering datasets, and consider two different objectives: cross\-entropy loss and on\-policy distillation\(Shenfeldet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib57); Lu,[2025](https://arxiv.org/html/2605.15220#bib.bib58); Zhaoet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib45)\)\. Further training details and hyperparameter choices are in Appendix[A](https://arxiv.org/html/2605.15220#A1)\.

#### Pretraining baselines\.

ERMsamples from each data domain with probability proportional to domain size, equivalent to not optimizing the data mixture\.MergeMix\(Wanget al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib59)\)finetunes independent models on each dataset, merges to simulate mixing, and uses regression to estimate the optimal mixture; we adapt it to pretraining with a 20% ERM warmup before finetuning \(see Appendix[A\.2](https://arxiv.org/html/2605.15220#A1.SS2)for more details\)\. It is a natural comparison—essentiallyOP\-Mixwithout data mixture expansion \(Algorithm[1](https://arxiv.org/html/2605.15220#alg1), line 6\) and with full finetuning in place of LoRA\.OLMix\(Chenet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib1)\)trains small proxy models on randomly sampled mixtures over datasets and uses regression to estimate the optimal mixture\.

#### Continual learning baselines\.

Continual fine\-tuning with WSD\-S\(Wenet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib47)\)trains on each dataset in succession with no replay of old data, using Weight\-Stable\-Decay\-Simplified, a learning rate schedule designed for continual learning\. For simplicity, we use WSD\-S for all methods, includingOP\-Mix\.10% data replayextends the observation ofBéthuneet al\.\([2025](https://arxiv.org/html/2605.15220#bib.bib49)\)that having 10% of finetuning data be pretraining data mitigates catastrophic forgetting; we train with a 1:9 ratio between old and new data\.Retraining\(skyline\): After training forK⋅RK\\cdot Rtokens fromKKdatasets and receiving a\(K\+1\)\(K\{\+\}1\)th dataset, train again for\(K\+1\)⋅R\(K\{\+\}1\)\\cdot Rtokens over allK\+1K\{\+\}1datasets\.

To our knowledge, there currently are no adaptive data mixing baselines for continual learning\. As noted in §[2](https://arxiv.org/html/2605.15220#S2), existing data mixing methods either operate over fixed data domains or require separate proxy models initialized from scratch\.

![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/pretraining_summary_subset_grid.png)Figure 4:Pretraining:OP\-Mixoutperforms empirical risk minimization \(grey line\), which samples from all data domains with uniform probability, and beats or matches the performance of other data mixing baselines while being up to 14% more efficient \(Figure[2](https://arxiv.org/html/2605.15220#S0.F2)\)\.
### 4\.1OP\-MixWorks Across the Language Model Lifecycle

Across pretraining, continual midtraining, and continual instruction tuning,OP\-Mixachieves state\-of\-the\-art performance \(Figures[4](https://arxiv.org/html/2605.15220#S4.F4)–[6](https://arxiv.org/html/2605.15220#S4.F6)\)\.

#### Pretraining \(Figure[4](https://arxiv.org/html/2605.15220#S4.F4)\)\.

We pretrain three different model sizes—150M, 300M, and 530M—from the OLMo model ladder\(Groeneveldet al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib32)\)to Chinchilla\-optimal\(Hoffmannet al\.,[2022](https://arxiv.org/html/2605.15220#bib.bib56)\)token counts of 3\.2B, 6\.5B, and 10\.5B, respectively\. We construct the pretraining data from 5 data domains: Algebraic Stack, ArXiv, c4, Reddit, and StackExchange\(Raffelet al\.,[2020](https://arxiv.org/html/2605.15220#bib.bib31); Weberet al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib54)\)\. During evaluation, we measure perplexity on all data domains and compute overall perplexity by a simple unweighted average\. Each data domain contains more than 10\.5B tokens, so no data mixture trains for more than one epoch on any data domain\.

In pretraining,OP\-Mixmatches the performance of MergeMix using up to 14% less compute \(Figure[2](https://arxiv.org/html/2605.15220#S0.F2)A\) and outperforms OLMix by 5\-6% at every scale\. This is consistent with our on\-policy hypothesis:OP\-Mixand MergeMix both build proxies from the model being trained, while OLMix uses a separate proxy whose learning dynamics diverge from the base model\. These results are roughly mirrored by the downstream evaluations in Appendix Table[2](https://arxiv.org/html/2605.15220#A0.T2), whereOP\-Mixis either best or second\-best in downstream task performance\. Overall,OP\-Mixconsistently outperforms ERM in perplexity and downstream evaluations while being more efficient than MergeMix\.

![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/clm_summary_subset_grid.png)Figure 5:Continual midtraining:OP\-Mixoutperforms other continual learning baselines and is even competitive with full retraining \(grey line\), despite training on the datasets sequentially\.
#### Continual Midtraining \(Figure[5](https://arxiv.org/html/2605.15220#S4.F5)\)\.

In continual midtraining, the user receives a stream of new data domains, emulating the real\-life scenario where one updates a base model using new reference datasets\. Here, our data mixture must expand per new dataset\. In this section, we finetune open\-source LMs pretrained on C4\(Raffelet al\.,[2020](https://arxiv.org/html/2605.15220#bib.bib31)\)from the DataDecide model suite\(Magnussonet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib55)\)\. We continually finetune models of parameter counts 150M, 300M, and 530M on Algebraic Stack, ArXiv, Open Web Math, Reddit, and StackExchange in alphabetic order\. To account for ordering effects, we cyclically permute the order of the datasets so that each dataset appears once in every order position and train on all five combinations \(see Table[6](https://arxiv.org/html/2605.15220#A1.T6)in Appendix\)\.

During continual midtraining, continual SFT suffers severe catastrophic forgetting \(Figure[3](https://arxiv.org/html/2605.15220#S2.F3)\), and of the data\-mixing methods,OP\-Mixis best at mitigating it, nearly matching the performance of full retraining \(Figure[9](https://arxiv.org/html/2605.15220#A0.F9)\) while using up to 66% less compute\. In Figure[10](https://arxiv.org/html/2605.15220#A0.F10)\(Appendix\), we also include an ablation that merges in trained LoRAs into the base model using the optimizedα∗\\alpha^\{\*\}, instead of full finetuning\. Although better than Continual SFT, LoRA\-Merge is significantly worse thanOP\-Mix, indicating that there are benefits to using LoRA only as a proxy\.

#### Continual Instruction Tuning \(Figure[6](https://arxiv.org/html/2605.15220#S4.F6)\)\.

We take the following continual learning task and ordering verbatim fromShenfeldet al\.\([2026](https://arxiv.org/html/2605.15220#bib.bib57)\): we continually finetune Qwen2\.5\-7B\-Instruct\(Yanget al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib30)\)on three instruction\-following domains—Tool Use \(4k examples\), Science \(1\.2k examples\), and Medical \(10k examples\)—introduced one at a time\. In addition to standard SFT, we test Self\-Distillation Finetuning \(SDFT\)\(Shenfeldet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib57)\)as the training objective\. Performance is measured by mean accuracy across domains\.

OP\-Mixon top of standard SFT \(60\.0%\) matches the performance of SDFT \(60\.2%\) while using 95% less compute, demonstrating that data mixing alone can recover the gains of a more sophisticated continual learning algorithm\. \(SDFT uses more compute as it repeatedly generates and distills on its own training data\.\) The two methods are also synergistic: combiningOP\-Mixwith SDFT achieves the best overall performance \(61\.9%\), suggesting that data mixing and objective modifications are orthogonal axes of improvement\.

### 4\.2Efficiency:OP\-MixPareto\-Dominates on the Performance\-Efficiency Frontier

Figure[2](https://arxiv.org/html/2605.15220#S0.F2)\(on page 2\) compares methods by final performance versus total training FLOPs, counting both mixture selection and final training\.OP\-MixPareto\-dominates across pretraining, continual midtraining, and continual instruction tuning: no baseline achieves better performance at lower compute\.OP\-Mix’s reuse of mixtures especially matters in continual midtraining \(Figure[2](https://arxiv.org/html/2605.15220#S0.F2)B\), where the cost of naive retraining grows with every new domain\. Unlike retraining,OP\-Mixturns adaptive data mixing into a lightweight operation that can be repeated whenever new data arrives\.

![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/sdft_summary.png)Figure 6:Continual instruction tuning:OP\-Mixworks across cross\-entropy and KL distillation objectives, improving the performance of both supervised finetuning and self\-distillation fine tuning\.![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/opm_optimal_mixture_2x3.png)Figure 7:OP\-Mix\(purple\) closely tracks the true data mixing loss surface \(red\), which we obtain by running full finetuning to completion at each mixture, for different model sizes\.

## 5Analysis

### 5\.1OP\-MixReliably and Efficiently Estimates Optimal Data Mixtures

We ask whetherOP\-Mixconsistently recovers good mixing weights\. In Figure[7](https://arxiv.org/html/2605.15220#S4.F7), we plot the true loss surface with respect to mixture proportions in red and the estimated loss surface fromOP\-Mixin purple\. We generate the true loss surface by training on those proportions for all of training, as opposed to training a proxy\. We find that merging LoRAs closely tracks the true data mixing loss surface\. More concretely, in Figure[8](https://arxiv.org/html/2605.15220#A0.F8)in the Appendix, we sweep for the best mixture at each stage in the continual midtraining setting and find that the average increase in loss ofOP\-Mixfrom the optimal proportions is 0\.9%, compared to a 2\.9% increase for the fixed 10% data replay baseline\.

### 5\.2Theoretical Analysis: FormalizingOP\-Mix’s Sources of Error

We now analyze the conditions under whichOP\-Mixrecovers an optimal interpolation weight, and bound the suboptimality of its predicted weight to the optimal weight\. This section formalizes the intuition thatOP\-Mix’s error is small if 1\) LoRA performance is a good approximation of full fine\-tuning performance and 2\) linear interpolation is a good approximation for mixing\.

#### Setup\.

Suppose we receive a new domainDm\+1D\_\{m\+1\}\. LetF\(α\)=1N∑j=1Nfj\(θtrain\(α\)\)F\(\\alpha\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}f\_\{j\}\(\\theta^\{\\text\{train\}\}\(\\alpha\)\)denote the average evaluation performance of a model trained on the mixture assigning weightα∈\[0,1\]\\alpha\\in\[0,1\]toDm\+1D\_\{m\+1\}and distributing1−α1\-\\alphaover previous domainsDoldD\_\{\\text\{old\}\}\.OP\-Mixconstructs a proxy forFFvia two approximations: rather than training a full model onDm\+1D\_\{m\+1\},OP\-Mixtrains two LoRA adaptersθDm\+1LoRA\\theta^\{\\text\{LoRA\}\}\_\{D\_\{m\+1\}\}andθDoldLoRA\\theta^\{\\text\{LoRA\}\}\_\{D\_\{\\text\{old\}\}\}; and rather than training on multiple data mixtures,OP\-Mixevaluates linear interpolations between the two LoRA adapters, yielding the proxy lossF^\(α\)=1N∑i=1Nfi\(θLoRA\(α\)\)\\hat\{F\}\(\\alpha\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\\\!\\left\(\\theta^\{\\text\{LoRA\}\}\(\\alpha\)\\right\)\.

To isolate the contributions of these two approximations, we followChenet al\.\([2026](https://arxiv.org/html/2605.15220#bib.bib1)\)and define an intermediate surfaceFM\(α\)=1N∑i=1Nfi\(θfull\(α\)\)F^\{M\}\(\\alpha\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\(\\theta^\{\\text\{full\}\}\(\\alpha\)\), whereθfull\(α\)=\(1−α\)⋅θDold\+α⋅θDm\+1\\theta^\{\\text\{full\}\}\(\\alpha\)=\(1\-\\alpha\)\\cdot\\theta\_\{D\_\{\\text\{old\}\}\}\+\\alpha\\cdot\\theta\_\{D\_\{m\+1\}\}interpolates full finetuning updates instead of LoRAs\. The two errors are then:

εmerge:=supα∈\[0,1\]\|F\(α\)−FM\(α\)\|,εLoRA:=supα∈\[0,1\]\|FM\(α\)−F^\(α\)\|\.\\displaystyle\\varepsilon\_\{\\text\{merge\}\}:=\\sup\_\{\\alpha\\in\[0,1\]\}\\left\|F\(\\alpha\)\-F^\{M\}\(\\alpha\)\\right\|,\\quad\\varepsilon\_\{\\text\{LoRA\}\}:=\\sup\_\{\\alpha\\in\[0,1\]\}\\left\|F^\{M\}\(\\alpha\)\-\\hat\{F\}\(\\alpha\)\\right\|\.
If both approximations are exact \(εmerge=εLoRA=0\\varepsilon\_\{\\mathrm\{merge\}\}=\\varepsilon\_\{\\mathrm\{LoRA\}\}=0\), thenOP\-Mixreturns an optimal interpolation weight\. When the approximations are imperfect, the following bound holds:

###### Proof sketch\.

See Appendix[B](https://arxiv.org/html/2605.15220#A2)for full proof\. The following holds:

J\(α^\)−J\(α⋆\)=\[J\(α^\)−J^\(α^\)\]⏟≤εmerge\+εLoRA\+\[J^\(α^\)−J^\(α⋆\)\]⏟≤0\+\[J^\(α⋆\)−J\(α⋆\)\]⏟≤εmerge\+εLoRA\.\\displaystyle J\(\\hat\{\\alpha\}\)\-J\(\\alpha^\{\\star\}\)\\;=\\;\\underbrace\{\\big\[J\(\\hat\{\\alpha\}\)\-\\hat\{J\}\(\\hat\{\\alpha\}\)\\big\]\}\_\{\\leq\\,\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\}\\;\+\\;\\underbrace\{\\big\[\\hat\{J\}\(\\hat\{\\alpha\}\)\-\\hat\{J\}\(\\alpha^\{\\star\}\)\\big\]\}\_\{\\leq\\;0\}\\;\+\\;\\underbrace\{\\big\[\\hat\{J\}\(\\alpha^\{\\star\}\)\-J\(\\alpha^\{\\star\}\)\\big\]\}\_\{\\leq\\,\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\}\.The middle term is nonpositive becauseα^\\hat\{\\alpha\}also minimizesJ^\\hat\{J\}\. For the other two terms, the regularization terms cancel and the triangle inequality throughFMF^\{M\}gives\|F\(α\)−F^\(α\)\|≤εmerge\+εLoRA\|F\(\\alpha\)\-\\hat\{F\}\(\\alpha\)\|\\leq\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\. ∎

We verify empirically in Figure[8](https://arxiv.org/html/2605.15220#A0.F8)that the overall approximation gap is small and non\-increasing across continual learning stages\. Furthermore,εmerge\\varepsilon\_\{\\text\{merge\}\}being small is empirically supported by linear mode connectivity\(Frankleet al\.,[2020](https://arxiv.org/html/2605.15220#bib.bib3); Wortsmanet al\.,[2022](https://arxiv.org/html/2605.15220#bib.bib5)\), which observes that linear interpolation does not incur large loss spikes when the finetuned models share a base model \(see Corollary[B\.2](https://arxiv.org/html/2605.15220#A2.Thmlemma2)\)\.

## 6Related Work and Discussion

#### Continual learning\.

The core challenge in continual learning is catastrophic forgetting, where training on new data degrades performance on previously learned tasks\(McCloskey and Cohen,[1989](https://arxiv.org/html/2605.15220#bib.bib40)\)\. Approaches to mitigating forgetting fall into three broad families: regularization\-based methods constrain parameter updates to protect knowledge from earlier tasks\(Kirkpatricket al\.,[2017](https://arxiv.org/html/2605.15220#bib.bib39); Aljundiet al\.,[2018](https://arxiv.org/html/2605.15220#bib.bib38)\); replay\-based methods retain or regenerate examples from previous tasks\(Rolnicket al\.,[2019](https://arxiv.org/html/2605.15220#bib.bib37); Shinet al\.,[2017](https://arxiv.org/html/2605.15220#bib.bib36)\); and architecture\-based methods allocate new capacity for new tasks\(Rusuet al\.,[2022](https://arxiv.org/html/2605.15220#bib.bib43); Wanget al\.,[2023](https://arxiv.org/html/2605.15220#bib.bib44)\)\.OP\-Mixis a replay\-based method\.

In the context of LLMs, forgetting can manifest across pretraining, instruction tuning, and alignment stages\(Shiet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib41); Zhenget al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib42)\)\. Our work considers in\-weights learning, which updates model parameters\. A parallel line of work keeps LLM weights frozen and accumulates knowledge in\-context, e\.g\., via soft prompts\(Razdaibiedinaet al\.,[2023](https://arxiv.org/html/2605.15220#bib.bib6)\)or modular KV\-cache cartridges\(Eyubogluet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib25)\)\. However, the storage of in\-context approaches grows with the dataset size, so amortizing knowledge from context to parameters using in\-weights learning remains relevant\.

#### Data mixing\.

Yeet al\.\([2025](https://arxiv.org/html/2605.15220#bib.bib2)\)established scaling laws for data mixtures, showing that downstream performance is a predictable function of mixture proportions\. This empirically grounded the pipeline of training small proxy models on candidate mixtures and fitting a regression model to extrapolate to full scale\(Liuet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib26); Chenet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib1)\)\. In contrast to this offline approach, several online data mixing algorithms have been proposed for pretraining based on distributionally robust optimization, including DoReMi\(Xieet al\.,[2023](https://arxiv.org/html/2605.15220#bib.bib33)\), DoGE\(Fanet al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib27)\)and GRAPE\(Fanet al\.,[2025](https://arxiv.org/html/2605.15220#bib.bib28)\)\.Chenet al\.\([2025](https://arxiv.org/html/2605.15220#bib.bib29)\)showed that both classes of algorithms are instances of the same linear framework\.

#### Limitations and Future Work\.

Our experiments top out at 530M parameters for pretraining and midtraining and 7B for instruction tuning, leaving open howOP\-Mixbehaves at frontier scale \(70B\+ parameters\)\. We also do not characterize how the LoRA proxy and model merging behave as the number of domains grows to 10 or 100\. Future work can extendOP\-Mixto reward\-based objectives like RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.15220#bib.bib50)\), where LoRA tends to work well\(Schulman,[2025](https://arxiv.org/html/2605.15220#bib.bib52)\)\. More broadly, our lifecycle\-unification direction suggests that other training decisions, such as learning rate schedules or training objectives, may admit similarly unified formulations\.

#### Conclusion\.

The various phases of language model training are artificial divisions, and data mixing algorithms should work gracefully across them in a continual setting\. Existing methods fall short on two fronts: they cannot incorporate new datasets, and they rely on off\-policy proxy models\. We address both limitations withOP\-Mix, the first data mixing algorithm to achieve state\-of\-the\-art results across pretraining, continual midtraining, and continual instruction tuning\. By exploiting LoRA and linear mode connectivity to cheaply simulate candidate mixtures,OP\-Mixturns adaptive data mixing into a lightweight operation that can be repeated whenever new data arrives\.

## Acknowledgments

We thank Graham Neubig, John Langford, Maxime Peyrard, Sebastian Cygert, Mayee Chen, and the NYU Computation and Psycholinguistics Lab for discussion and feedback\. MYH is supported by the NSF Graduate Research Fellowship\. AG is supported by the Amazon AI PhD Fellowship\.

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise\. This work was supported by the Institute of Information & Communications Technology Planning & Evaluation \(IITP\) with a grant funded by the Ministry of Science and ICT \(MSIT\) of the Republic of Korea in connection with the Global AI Frontier Lab International Collaborative Research\. This work was also supported by the Samsung Advanced Institute of Technology \(under the project Next Generation Deep Learning: From Pattern Recognition to AI\) and the National Science Foundation \(under NSF Awards 1922658, IIS\-2239862, and IIS\-2433429\)\.

## References

- Memory aware synapses: learning what \(not\) to forget\.InComputer Vision – ECCV 2018: 15th European Conference, Munich, Germany, September 8–14, 2018, Proceedings, Part III,Berlin, Heidelberg,pp\. 144–161\.External Links:ISBN 978\-3\-030\-01218\-2,[Link](https://doi.org/10.1007/978-3-030-01219-9_9),[Document](https://dx.doi.org/10.1007/978-3-030-01219-9%5F9)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p1.1)\.
- L\. Béthune, D\. Grangier, D\. Busbridge, E\. Gualdoni, marco cuturi, and P\. Ablin \(2025\)Scaling laws for forgetting during finetuning with pretraining data injection\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=vWMij23BmQ)Cited by:[§4](https://arxiv.org/html/2605.15220#S4.SS0.SSS0.Px2.p1.5)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThirty\-Fourth AAAI Conference on Artificial Intelligence,Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- M\. F\. Chen, M\. Y\. Hu, N\. Lourie, K\. Cho, and C\. Re \(2025\)Aioli: a unified optimization framework for language model data mixing\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=sZGZJhaNSe)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p1.1),[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p1.6),[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p2.6),[Table 1](https://arxiv.org/html/2605.15220#S2.T1.2.3.2.2),[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px2.p1.1)\.
- M\. F\. Chen, T\. Murray, D\. Heineman, M\. Jordan, H\. Hajishirzi, C\. Ré, L\. Soldaini, and K\. Lo \(2026\)Olmix: a framework for data mixing throughout lm development\.arXiv preprint arXiv:2602\.12237\.Cited by:[§B\.1](https://arxiv.org/html/2605.15220#A2.SS1.p1.1),[§1](https://arxiv.org/html/2605.15220#S1.p1.1),[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p2.6),[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p3.1),[Table 1](https://arxiv.org/html/2605.15220#S2.T1.2.6.5.2),[§3](https://arxiv.org/html/2605.15220#S3.p2.1),[§4](https://arxiv.org/html/2605.15220#S4.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.15220#S5.SS2.SSS0.Px1.p2.2),[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px2.p1.1)\.
- M\. F\. Chen, N\. Roberts, K\. Bhatia, J\. Wang, C\. Zhang, F\. Sala, and C\. Ré \(2023\)Skill\-it\! a data\-driven skills framework for understanding and training language models\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NIPS ’23,Red Hook, NY, USA\.Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InNAACL,Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.ArXivabs/1803\.05457\.Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.External Links:1810\.04805,[Link](https://arxiv.org/abs/1810.04805)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p3.1)\.
- S\. Diamond and S\. Boyd \(2016\)CVXPY: A Python\-embedded modeling language for convex optimization\.Journal of Machine Learning Research17\(83\),pp\. 1–5\.Cited by:[Table 3](https://arxiv.org/html/2605.15220#A1.T3.4.4.1),[13](https://arxiv.org/html/2605.15220#alg1.l13.1)\.
- S\. Eyuboglu, R\. S\. Ehrlich, S\. Arora, N\. Guha, D\. Zinsley, E\. R\. Liu, A\. Rudra, J\. Zou, A\. Mirhoseini, and C\. Re \(2026\)Cartridges: lightweight and general\-purpose long context representations via self\-study\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0k5w8O0SNg)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p2.1)\.
- S\. Fan, M\. I\. Glarou, and M\. Jaggi \(2025\)GRAPE: optimize data mixture for group robust multi\-target adaptive pretraining\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=JRmIvBcnWc)Cited by:[Table 1](https://arxiv.org/html/2605.15220#S2.T1.2.5.4.2),[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Fan, M\. Pagliardini, and M\. Jaggi \(2024\)DOGE: domain reweighting with generalization estimation\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=7rfZ6bMZq4)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p1.1),[Table 1](https://arxiv.org/html/2605.15220#S2.T1.2.4.3.2),[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Frankle, G\. K\. Dziugaite, D\. M\. Roy, and M\. Carbin \(2020\)Linear mode connectivity and the lottery ticket hypothesis\.InProceedings of the 37th International Conference on Machine Learning,ICML’20\.Cited by:[§5\.2](https://arxiv.org/html/2605.15220#S5.SS2.SSS0.Px1.p4.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p3.1)\.
- D\. Groeneveld, I\. Beltagy, E\. Walsh, A\. Bhagia, R\. Kinney, O\. Tafjord, A\. Jha, H\. Ivison, I\. Magnusson, Y\. Wang, S\. Arora, D\. Atkinson, R\. Authur, K\. Chandu, A\. Cohan, J\. Dumas, Y\. Elazar, Y\. Gu, J\. Hessel, T\. Khot, W\. Merrill, J\. Morrison, N\. Muennighoff, A\. Naik, C\. Nam, M\. Peters, V\. Pyatkin, A\. Ravichander, D\. Schwenk, S\. Shah, W\. Smith, E\. Strubell, N\. Subramani, M\. Wortsman, P\. Dasigi, N\. Lambert, K\. Richardson, L\. Zettlemoyer, J\. Dodge, K\. Lo, L\. Soldaini, N\. Smith, and H\. Hajishirzi \(2024\)OLMo: accelerating the science of language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15789–15809\.External Links:[Link](https://aclanthology.org/2024.acl-long.841/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.841)Cited by:[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark,et al\.\(2022\)Training compute\-optimal large language models\.arXiv preprint arXiv:2203\.1555610\.Cited by:[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p2.1),[§3](https://arxiv.org/html/2605.15220#S3.p1.1)\.
- Y\. Jiang, A\. Zhou, Z\. Feng, S\. Malladi, and J\. Z\. Kolter \(2025\)Adaptive data optimization: dynamic sample selection with scaling laws\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=aqok1UX7Z1)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p1.1),[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p3.1),[Table 1](https://arxiv.org/html/2605.15220#S2.T1.2.2.1.2)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska, D\. Hassabis, C\. Clopath, D\. Kumaran, and R\. Hadsell \(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences114\(13\),pp\. 3521–3526\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1611835114),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.1611835114),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1611835114Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p1.1)\.
- E\. Liu, G\. Neubig, and C\. Xiong \(2026\)Midtraining bridges pretraining and posttraining distributions\.External Links:2510\.14865,[Link](https://arxiv.org/abs/2510.14865)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p3.1)\.
- Q\. Liu, X\. Zheng, N\. Muennighoff, G\. Zeng, L\. Dou, T\. Pang, J\. Jiang, and M\. Lin \(2025\)RegMix: data mixture as regression for language model pre\-training\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=5BjQOUXq7i)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p1.1),[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p2.6),[Table 1](https://arxiv.org/html/2605.15220#S2.T1.2.7.6.2),[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px2.p1.1)\.
- K\. Lu \(2025\)On\-policy distillation\.Thinking Machines Lab: Connectionism\.Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p3.1),[§4](https://arxiv.org/html/2605.15220#S4.p1.1)\.
- I\. Magnusson, N\. Tai, B\. Bogin, D\. Heineman, J\. D\. Hwang, L\. Soldaini, A\. Bhagia, J\. Liu, D\. Groeneveld, O\. Tafjord, N\. A\. Smith, P\. W\. Koh, and J\. Dodge \(2025\)DataDecide: how to predict best pretraining data with small experiments\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=p9YlQPF8fE)Cited by:[§A\.2](https://arxiv.org/html/2605.15220#A1.SS2.p1.1),[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px2.p1.1)\.
- M\. McCloskey and N\. J\. Cohen \(1989\)Catastrophic interference in connectionist networks: the sequential learning problem\.G\. H\. Bower \(Ed\.\),Psychology of Learning and Motivation, Vol\.24,pp\. 109–165\.External Links:ISSN 0079\-7421,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/S0079-7421%2808%2960536-8),[Link](https://www.sciencedirect.com/science/article/pii/S0079742108605368)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p1.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InEMNLP,Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- T\. OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan, N\. Lambert, D\. Schwenk, O\. Tafjord, T\. Anderson, D\. Atkinson, F\. Brahman, C\. Clark, P\. Dasigi, N\. Dziri, A\. Ettinger, M\. Guerquin, D\. Heineman, H\. Ivison, P\. W\. Koh, J\. Liu, S\. Malik, W\. Merrill, L\. J\. V\. Miranda, J\. Morrison, T\. Murray, C\. Nam, J\. Poznanski, V\. Pyatkin, A\. Rangapur, M\. Schmitz, S\. Skjonsberg, D\. Wadden, C\. Wilhelm, M\. Wilson, L\. Zettlemoyer, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi \(2025\)2 olmo 2 furious\.External Links:2501\.00656,[Link](https://arxiv.org/abs/2501.00656)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p3.1),[§4](https://arxiv.org/html/2605.15220#S4.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Gray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=TG8KACxEON)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px3.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p3.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.External Links:[Link](http://jmlr.org/papers/v21/20-074.html)Cited by:[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Razdaibiedina, Y\. Mao, R\. Hou, M\. Khabsa, M\. Lewis, and A\. Almahairi \(2023\)Progressive prompts: continual learning for language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=UJTgQBc91_)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p2.1)\.
- D\. Rolnick, A\. Ahuja, J\. Schwarz, T\. Lillicrap, and G\. Wayne \(2019\)Experience replay for continual learning\.InAdvances in Neural Information Processing Systems,H\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d'Alché\-Buc, E\. Fox, and R\. Garnett \(Eds\.\),Vol\.32,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2019/file/fa7cdfad1a5aaf8370ebeda47a1ff1c3-Paper.pdf)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p1.1)\.
- A\. A\. Rusu, N\. C\. Rabinowitz, G\. Desjardins, H\. Soyer, J\. Kirkpatrick, K\. Kavukcuoglu, R\. Pascanu, and R\. Hadsell \(2022\)Progressive neural networks\.External Links:1606\.04671,[Link](https://arxiv.org/abs/1606.04671)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2019\)WinoGrande: an adversarial winograd schema challenge at scale\.arXiv preprint arXiv:1907\.10641\.Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- J\. Schulman \(2025\)LoRA without regret\.Note:[https://thinkingmachines\.ai/blog/lora/](https://thinkingmachines.ai/blog/lora/)Thinking Machines Lab blog post\. In collaboration with others at Thinking MachinesCited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px3.p1.1)\.
- I\. Shenfeld, M\. Damani, J\. Hübotter, and P\. Agrawal \(2026\)Self\-distillation enables continual learning\.External Links:2601\.19897,[Link](https://arxiv.org/abs/2601.19897)Cited by:[§A\.4](https://arxiv.org/html/2605.15220#A1.SS4.p1.2),[§A\.4](https://arxiv.org/html/2605.15220#A1.SS4.p2.1),[item 3](https://arxiv.org/html/2605.15220#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.15220#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.15220#S4.p1.1)\.
- H\. Shi, Z\. Xu, H\. Wang, W\. Qin, W\. Wang, Y\. Wang, Z\. Wang, S\. Ebrahimi, and H\. Wang \(2025\)Continual learning of large language models: a comprehensive survey\.ACM Comput\. Surv\.58\(5\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3735633),[Document](https://dx.doi.org/10.1145/3735633)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p2.1)\.
- H\. Shin, J\. K\. Lee, J\. Kim, and J\. Kim \(2017\)Continual learning with deep generative replay\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 2994–3003\.External Links:ISBN 9781510860964Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant \(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4149–4158\.External Links:[Link](https://aclanthology.org/N19-1421/),[Document](https://dx.doi.org/10.18653/v1/N19-1421)Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- Z\. S\. Tao, K\. Vinken, H\. Yeh, A\. Cooper, and X\. Boix \(2025\)Merge to mix: mixing datasets via model merging\.External Links:2505\.16066,[Link](https://arxiv.org/abs/2505.16066)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p2.1),[§3](https://arxiv.org/html/2605.15220#S3.p2.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p3.1)\.
- J\. Wang, C\. Tian, K\. Chen, Z\. Liu, J\. Mao, W\. X\. Zhao, Z\. Zhang, and J\. Zhou \(2026\)MergeMix: optimizing mid\-training data mixtures via learnable model merging\.External Links:2601\.17858,[Link](https://arxiv.org/abs/2601.17858)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p2.1),[Table 1](https://arxiv.org/html/2605.15220#S2.T1.2.8.7.2),[§3](https://arxiv.org/html/2605.15220#S3.p2.1),[§4](https://arxiv.org/html/2605.15220#S4.SS0.SSS0.Px1.p1.1)\.
- X\. Wang, T\. Chen, Q\. Ge, H\. Xia, R\. Bao, R\. Zheng, Q\. Zhang, T\. Gui, and X\. Huang \(2023\)Orthogonal subspace learning for language model continual learning\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 10658–10671\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.715/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.715)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p1.1)\.
- M\. Weber, D\. Fu, Q\. Anthony, Y\. Oren, S\. Adams, A\. Alexandrov, X\. Lyu, H\. Nguyen, X\. Yao, V\. Adams, B\. Athiwaratkun, R\. Chalamala, K\. Chen, M\. Ryabinin, T\. Dao, P\. Liang, C\. Ré, I\. Rish, and C\. Zhang \(2024\)RedPajama: an open dataset for training large language models\.External Links:2411\.12372,[Link](https://arxiv.org/abs/2411.12372)Cited by:[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Wei, M\. Bosma, V\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. Le \(2022\)Finetuned language models are zero\-shot learners\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gEZrGCozdqR)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p3.1)\.
- K\. Wen, Z\. Li, J\. S\. Wang, D\. L\. W\. Hall, P\. Liang, and T\. Ma \(2025\)Understanding warmup\-stable\-decay learning rates: a river valley loss landscape view\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=m51BgoqvbP)Cited by:[Figure 3](https://arxiv.org/html/2605.15220#S2.F3),[Figure 3](https://arxiv.org/html/2605.15220#S2.F3.5.2.1.1),[§4](https://arxiv.org/html/2605.15220#S4.SS0.SSS0.Px2.p1.5)\.
- M\. Wortsman, G\. Ilharco, S\. Y\. Gadre, R\. Roelofs, R\. Gontijo\-Lopes, A\. S\. Morcos, H\. Namkoong, A\. Farhadi, Y\. Carmon, S\. Kornblith, and L\. Schmidt \(2022\)Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time\.InProceedings of the 39th International Conference on Machine Learning,K\. Chaudhuri, S\. Jegelka, L\. Song, C\. Szepesvari, G\. Niu, and S\. Sabato \(Eds\.\),Proceedings of Machine Learning Research, Vol\.162,pp\. 23965–23998\.External Links:[Link](https://proceedings.mlr.press/v162/wortsman22a.html)Cited by:[§5\.2](https://arxiv.org/html/2605.15220#S5.SS2.SSS0.Px1.p4.1)\.
- S\. M\. Xie, H\. Pham, X\. Dong, N\. Du, H\. Liu, Y\. Lu, P\. Liang, Q\. V\. Le, T\. Ma, and A\. W\. Yu \(2023\)DoReMi: optimizing data mixtures speeds up language model pretraining\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=lXuByUeHhd)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p1.1),[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p3.1)\.
- A\. Yang, B\. Yang, B\. Hui, B\. Zheng, B\. Yu, C\. Zhou, C\. Li, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Tang, J\. Wang, J\. Yang, J\. Tu, J\. Zhang, J\. Ma, J\. Yang, J\. Xu, J\. Zhou, J\. Bai, J\. He, J\. Lin, K\. Dang, K\. Lu, K\. Chen, K\. Yang, M\. Li, M\. Xue, N\. Ni, P\. Zhang, P\. Wang, R\. Peng, R\. Men, R\. Gao, R\. Lin, S\. Wang, S\. Bai, S\. Tan, T\. Zhu, T\. Li, T\. Liu, W\. Ge, X\. Deng, X\. Zhou, X\. Ren, X\. Zhang, X\. Wei, X\. Ren, X\. Liu, Y\. Fan, Y\. Yao, Y\. Zhang, Y\. Wan, Y\. Chu, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Guo, and Z\. Fan \(2024\)Qwen2 technical report\.External Links:2407\.10671,[Link](https://arxiv.org/abs/2407.10671)Cited by:[§A\.4](https://arxiv.org/html/2605.15220#A1.SS4.p1.2),[§4\.1](https://arxiv.org/html/2605.15220#S4.SS1.SSS0.Px3.p1.1)\.
- J\. Ye, P\. Liu, T\. Sun, J\. Zhan, Y\. Zhou, and X\. Qiu \(2025\)Data mixing laws: optimizing data mixtures by predicting language modeling performance\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=jjCB27TMK3)Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p1.1),[§2](https://arxiv.org/html/2605.15220#S2.SS0.SSS0.Px2.p2.6),[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px2.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Cited by:[Table 2](https://arxiv.org/html/2605.15220#A0.T2)\.
- S\. Zhao, Z\. Xie, M\. Liu, J\. Huang, G\. Pang, F\. Chen, and A\. Grover \(2026\)Self\-distilled reasoner: on\-policy self\-distillation for large language models\.arXiv preprint arXiv:2601\.18734\.Cited by:[§1](https://arxiv.org/html/2605.15220#S1.p3.1),[§4](https://arxiv.org/html/2605.15220#S4.p1.1)\.
- J\. Zheng, X\. Cai, S\. Qiu, and Q\. Ma \(2025\)Spurious forgetting in continual learning of language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ScI7IlKGdI)Cited by:[§6](https://arxiv.org/html/2605.15220#S6.SS0.SSS0.Px1.p2.1)\.

![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/opm_vs_sweep_150M.png)Figure 8:OP\-Mixversus grid sweep\. In the continual midtraining setting,OP\-Mixconsistently achieves regret of 1\.18% or less with respect to the optimal value \(estimated by grid sweeping over mixtures\)\. Regret does not grow as more datasets are introduced, unlike with a fixed 10% old data mixture, where regret does grow\.![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/clm_suboptimality.png)Figure 9:OP\-Mixversus retraining\. In the continual midtraining setting,OP\-Mixnearly matches the performance of retraining, indicating that it successfully mitigates catastrophic forgetting on previously seen datasets\.![Refer to caption](https://arxiv.org/html/2605.15220v1/figures/clm_lora_merge_150M.png)Figure 10:LoRA merging only is not sufficient\. Simply merging in trained LoRA adapters \(grey\) without finetuning underperformsOP\-Mix\(purple\)\.Table 2:Downstream zero\-shot accuracy across ARC\-Easy, ARC\-Challenge\[Clarket al\.,[2018](https://arxiv.org/html/2605.15220#bib.bib13)\], BoolQ\[Clarket al\.,[2019](https://arxiv.org/html/2605.15220#bib.bib14)\], CommonsenseQA\[Talmoret al\.,[2019](https://arxiv.org/html/2605.15220#bib.bib19)\], HellaSwag\[Zellerset al\.,[2019](https://arxiv.org/html/2605.15220#bib.bib15)\], OpenBookQA\[Mihaylovet al\.,[2018](https://arxiv.org/html/2605.15220#bib.bib16)\], PIQA\[Bisket al\.,[2020](https://arxiv.org/html/2605.15220#bib.bib17)\], WinoGrande\[Sakaguchiet al\.,[2019](https://arxiv.org/html/2605.15220#bib.bib18)\], and MMLU\[Hendryckset al\.,[2021](https://arxiv.org/html/2605.15220#bib.bib7)\], alongside their unweighted average\. We ran evaluations usinglm\-eval\-harness\[Gaoet al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib4)\]\. Each block reports results for one model size\.Boldmarks the best result per column within each model size;italicsmark the second best average score\.## Appendix AReproducibility

### A\.1SharedOP\-Mixconfiguration

Across all three settings,OP\-Mixuses the same high\-level structure \(Algorithm[1](https://arxiv.org/html/2605.15220#alg1)\) and the same regression and solver\. Shared choices are listed in Table[3](https://arxiv.org/html/2605.15220#A1.T3)\.

Table 3:Hyperparameters shared byOP\-Mixacross pretraining, continual midtraining, and continual instruction tuning\.The proxy construction differs slightly between settings\. For pretraining, we Dirichlet\-samplePPproxy mix vectors over\{Dold,Dm\+1,…,Dm\+K\}\\\{D\_\{\\text\{old\}\},D\_\{m\+1\},\\dots,D\_\{m\+K\}\\\}\. For continual midtraining and continual instruction tuning, whereK=1K=1new domain arrives per stage, we replace Dirichlet sampling with a deterministic 9\-point grid over the old/new axis atαnew∈\{0\.1,0\.2,…,0\.9\}\\alpha\_\{\\text\{new\}\}\\in\\\{0\.1,0\.2,\\dots,0\.9\\\}\. In both continual settings the old\- and new\-domain LoRAs are trained with a*10/90*split \(old probe mixes 10% of the new domain into the old mix; new probe mixes 10% of the old mix into the new domain\) rather than one\-hot specialization; we found this to prevent overestimation of forgetting while still being mathematically correct\.

### A\.2Pretraining

We pretrain from configuration\-initialized OLMo models at three sizes on a five\-domain mix of Algebraic Stack, ArXiv, c4, Reddit, and StackExchange, all tokenized with the DataDecide Dolma v1\.5 tokenizerMagnussonet al\.\[[2025](https://arxiv.org/html/2605.15220#bib.bib55)\]\. The model ladder uses theallenai/DataDecide\-c4\-\{150M, 300M, 530M\}architectures, initialized from config only \(random weights\)\. Per\-size hyperparameters are in Table[4](https://arxiv.org/html/2605.15220#A1.T4)\.

Table 4:Pretraining hyperparameters forOP\-Mixand the ERM baseline\. MergeMix uses the same prefix, proxy count, and probe length, but trains full\-parameter proxies instead of LoRAs\. OLMix uses a separate 20M\-parameter proxy model \(Table[5](https://arxiv.org/html/2605.15220#A1.T5)\)\.The ERM prefix is trained on the uniform mix\(1/5,1/5,1/5,1/5,1/5\)\(1/5,1/5,1/5,1/5,1/5\)\. After the prefix, each of the 5 LoRA probes is trained on a*90%\-on\-its\-domain / 10%\-on\-the\-old\-mix*partition so the span of the 5 adapters covers the full simplex interior\. We then buildP=20P=20Dirichlet\-sampled interpolation merges, evaluate each on the held\-out shards of all 5 domains, fit the log\-linear regression, and train for the remaining0\.8R0\.8Rsteps on the fitted mix\.

Table 5:Proxy configuration for the OLMix baseline in pretraining\.
### A\.3Continual midtraining

We start from the pretrained DataDecide\-c4 checkpoints from pretraining and continually finetune on the same five domains as pretraining, introduced one stage at a time\. To control for ordering effects we run all five cyclic permutationsord0throughord4\(Table[6](https://arxiv.org/html/2605.15220#A1.T6)\)\.

Table 6:Cyclic orderings of the five midtraining domains\. Results are averaged across these five orderings\.Per\-stage hyperparameters are in Table[7](https://arxiv.org/html/2605.15220#A1.T7)\. Every stage consists of two LoRA probes \(old and new, with the 10/90 split above\), a 9\-point 1\-D proxy scan, acvxpymix fit on the reduced\(αold,αnew\)\(\\alpha\_\{\\text\{old\}\},\\alpha\_\{\\text\{new\}\}\)simplex, expansion ofαold⋆\\alpha\_\{\\text\{old\}\}^\{\\star\}onto the previous stage’s mix, and a full\-model finetune on the expanded mix\.

Table 7:Per\-stage hyperparameters for continual midtraining\. Stage 1 is a single\-domain finetune; stages 2–5 run the fullOP\-Mixpipeline on top of the previous stage’s checkpoint\.Baselines inherit the sameRR, batch size, sequence length, learning rate, and warmup schedule\. The “10% data replay” baseline fixesαold=0\.1\\alpha\_\{\\text\{old\}\}=0\.1at every stage and expands onto the previous mix using the same expansion mapEEasOP\-Mix\. “Retrain” trains from the original base model fork⋅Rk\\cdot Rsteps on the uniform mix over thekkdomains seen so far\.

### A\.4Continual instruction tuning

We use Qwen2\.5\-7B\-Instruct\[Yanget al\.,[2024](https://arxiv.org/html/2605.15220#bib.bib30)\]as the base, and reuse the three domains and ordering ofShenfeldet al\.\[[2026](https://arxiv.org/html/2605.15220#bib.bib57)\]: Tool Use \(4,046 examples\)→\\toScience \(1,233 examples\)→\\toMedical \(10,000 examples\)\. Each stage is one epoch over its dataset \(capped at 10,000 examples\)\. We evaluate with the SDFT accuracy metric ofShenfeldet al\.\[[2026](https://arxiv.org/html/2605.15220#bib.bib57)\], averaged across the domains seen so far\.

Table 8:Hyperparameters for the Qwen2\.5\-7B\-Instruct continual instruction tuning experiments\. The same settings are used for both the SFT and SDFT variants; SFT simply disables the SDFT\-specific options\. The LoRA probe is trained for a fixed 256 optimizer steps \(short relative to the∼2,500\\sim\\\!2\{,\}500\-step midtraining LoRA\) because the instruction datasets here are small\.The difference between SFT and SDFT\[Shenfeldet al\.,[2026](https://arxiv.org/html/2605.15220#bib.bib57)\]is the training objective: SFT uses cross\-entropy against the dataset targets, while SDFT replaces the targets with reverse KL\-divergence to a teacher model’s distribution\. In the case of SDFT, the teacher is a moving average variant of the student that also receives the correct answer\. In any case,OP\-Mixis applied identically on top of either objective: it only chooses the data\-mix weights fed into the training loop, so “SFT \+OP\-Mix” and “SDFT \+OP\-Mix” use exactly the proxy, regression, and solver pipeline of Table[3](https://arxiv.org/html/2605.15220#A1.T3), just with the respective loss\.

#### Compute\.

All experiments run on a cluster with a mix of A100, H100, L40S, and H200 GPUs\. A nice property of LoRA is that it allows on\-policy proxies to run on heterogeneous compute: for example, we can run pretraining on an H200 but run proxies on an L40S, which has less than a third of the GPU VRAM\.

#### Seeds and variance\.

Continual midtraining configuration is run for a single seed per \{model size, ordering\} cell; variance in the continual setting is instead quantified across the five cyclic orderings\. Pretraining and continual instruction tuning results are both averaged across seedss∈\{42,43,44\}s\\in\\\{42,43,44\\\}\.

## Appendix BTheory

Consider one continual step of Algorithm[1](https://arxiv.org/html/2605.15220#alg1)\. The current stage has previous domains\{D1,…,Dm\}\\\{D\_\{1\},\\dots,D\_\{m\}\\\}, over which we played the mixturept−1∈△m−1p\_\{t\-1\}\\in\\triangle^\{m\-1\}, and receives new domains\{Dm\+1,…,Dm\+K\}\\\{D\_\{m\+1\},\\dots,D\_\{m\+K\}\\\}\. Let

𝜶=\(αold,αm\+1,…,αm\+K\)∈△K\\boldsymbol\{\\alpha\}=\(\\alpha\_\{\\text\{old\}\},\\alpha\_\{m\+1\},\\dots,\\alpha\_\{m\+K\}\)\\in\\triangle^\{K\}denote the reduced\-simplex weights used byOP\-Mix, and letE:△K→△m\+K−1E:\\triangle^\{K\}\\to\\triangle^\{m\+K\-1\}be the mixture expansion map from Algorithm[1](https://arxiv.org/html/2605.15220#alg1)\. We writeθbase\\theta\_\{\\text\{base\}\}for the current base model, andθtrain\(𝜶\)\\theta^\{\\text\{train\}\}\(\\boldsymbol\{\\alpha\}\)for the model obtained by continuing training fromθbase\\theta\_\{\\text\{base\}\}on the expanded mixtureE\(𝜶\)E\(\\boldsymbol\{\\alpha\}\)\.

For the proxy construction, letθoldfull\\theta^\{\\text\{full\}\}\_\{\\text\{old\}\}be the result of full finetuning on the old\-data mixturept−1p\_\{t\-1\}, and letθDm\+kfull\\theta^\{\\text\{full\}\}\_\{D\_\{m\+k\}\}be the result of full finetuning onDm\+kD\_\{m\+k\}, all starting fromθbase\\theta\_\{\\text\{base\}\}\. Likewise, letθoldLoRA\\theta^\{\\text\{LoRA\}\}\_\{\\text\{old\}\}andθDm\+kLoRA\\theta^\{\\text\{LoRA\}\}\_\{D\_\{m\+k\}\}be the corresponding LoRA\-adapted models produced by Algorithm[1](https://arxiv.org/html/2605.15220#alg1), with the adapters applied toθbase\\theta\_\{\\text\{base\}\}\. We then define the merged full\-model proxy and merged LoRA proxy:

θM\(𝜶\)\\displaystyle\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\):=αoldθoldfull\+∑k=1Kαm\+kθDm\+kfull,\\displaystyle:=\\alpha\_\{\\text\{old\}\}\\,\\theta^\{\\text\{full\}\}\_\{\\text\{old\}\}\+\\sum\_\{k=1\}^\{K\}\\alpha\_\{m\+k\}\\,\\theta^\{\\text\{full\}\}\_\{D\_\{m\+k\}\},\(2\)θ^\(𝜶\)\\displaystyle\\hat\{\\theta\}\(\\boldsymbol\{\\alpha\}\):=αoldθoldLoRA\+∑k=1Kαm\+kθDm\+kLoRA\.\\displaystyle:=\\alpha\_\{\\text\{old\}\}\\,\\theta^\{\\text\{LoRA\}\}\_\{\\text\{old\}\}\+\\sum\_\{k=1\}^\{K\}\\alpha\_\{m\+k\}\\,\\theta^\{\\text\{LoRA\}\}\_\{D\_\{m\+k\}\}\.\(3\)Because the coefficients in𝜶\\boldsymbol\{\\alpha\}sum to one and every model above is trained from the sameθbase\\theta\_\{\\text\{base\}\}, these expressions are convex combinations of the corresponding parameter updates\.

The three loss surfaces are therefore

F\(𝜶\)\\displaystyle F\(\\boldsymbol\{\\alpha\}\):=1N∑j=1Nfj\(θtrain\(𝜶\)\),\\displaystyle:=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}f\_\{j\}\\\!\\left\(\\theta^\{\\text\{train\}\}\(\\boldsymbol\{\\alpha\}\)\\right\),\(4\)FM\(𝜶\)\\displaystyle F^\{M\}\(\\boldsymbol\{\\alpha\}\):=1N∑j=1Nfj\(θM\(𝜶\)\),\\displaystyle:=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}f\_\{j\}\\\!\\left\(\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\),\(5\)F^\(𝜶\)\\displaystyle\\hat\{F\}\(\\boldsymbol\{\\alpha\}\):=1N∑j=1Nfj\(θ^\(𝜶\)\)\.\\displaystyle:=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}f\_\{j\}\\\!\\left\(\\hat\{\\theta\}\(\\boldsymbol\{\\alpha\}\)\\right\)\.\(6\)
The regularized objectives are

J\(𝜶\)\\displaystyle J\(\\boldsymbol\{\\alpha\}\):=F\(𝜶\)\+λDKL\(E\(𝜶\)∥μ\),\\displaystyle:=F\(\\boldsymbol\{\\alpha\}\)\+\\lambda\\,D\_\{\\text\{KL\}\}\\\!\\bigl\(E\(\\boldsymbol\{\\alpha\}\)\\,\\big\\\|\\,\\mu\\bigr\),\(7\)J^\(𝜶\)\\displaystyle\\hat\{J\}\(\\boldsymbol\{\\alpha\}\):=F^\(𝜶\)\+λDKL\(E\(𝜶\)∥μ\),\\displaystyle:=\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)\+\\lambda\\,D\_\{\\text\{KL\}\}\\\!\\bigl\(E\(\\boldsymbol\{\\alpha\}\)\\,\\big\\\|\\,\\mu\\bigr\),\(8\)with optimizers

𝜶⋆\\displaystyle\\boldsymbol\{\\alpha\}^\{\\star\}:=arg⁡min𝜶∈△K⁡J\(𝜶\),\\displaystyle:=\\arg\\min\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}J\(\\boldsymbol\{\\alpha\}\),\(9\)𝜶^\\displaystyle\\hat\{\\boldsymbol\{\\alpha\}\}:=arg⁡min𝜶∈△K⁡J^\(𝜶\)\.\\displaystyle:=\\arg\\min\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}\\hat\{J\}\(\\boldsymbol\{\\alpha\}\)\.\(10\)
The two approximation errors are

εmerge\\displaystyle\\varepsilon\_\{\\text\{merge\}\}:=sup𝜶∈△K\|F\(𝜶\)−FM\(𝜶\)\|,\\displaystyle:=\\sup\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}\\left\|F\(\\boldsymbol\{\\alpha\}\)\-F^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\|,\(11\)εLoRA\\displaystyle\\varepsilon\_\{\\text\{LoRA\}\}:=sup𝜶∈△K\|FM\(𝜶\)−F^\(𝜶\)\|\.\\displaystyle:=\\sup\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}\\left\|F^\{M\}\(\\boldsymbol\{\\alpha\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)\\right\|\.\(12\)
###### Assumption 1\(Idealized proxy optimization\)\.

We assume: \(i\) the fitted regression surface used in Algorithm[1](https://arxiv.org/html/2605.15220#alg1)recovers the merged\-LoRA loss surface exactly, so the optimization step is equivalent to minimizingJ^\\hat\{J\}over△K\\triangle^\{K\}; and \(ii\) bothJJandJ^\\hat\{J\}are minimized exactly\. This isolates the structural approximation errors induced by LoRA and model merging from finite\-sample regression error, numerical optimization error, and proxy\-training budget mismatch\.

### B\.1Exact Recovery

###### Proposition 1\(Exact recovery\)\.

Under Assumption[1](https://arxiv.org/html/2605.15220#Thmassumption1), ifεmerge=εLoRA=0\\varepsilon\_\{\\mathrm\{merge\}\}=\\varepsilon\_\{\\mathrm\{LoRA\}\}=0, thenF^=F\\hat\{F\}=Fon△K\\triangle^\{K\}and every minimizer ofJ^\\hat\{J\}is also a minimizer ofJJ\. In particular,

𝜶^∈arg⁡min𝜶∈△K⁡J\(𝜶\)\.\\hat\{\\boldsymbol\{\\alpha\}\}\\in\\arg\\min\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}J\(\\boldsymbol\{\\alpha\}\)\.

###### Proof\.

Ifεmerge=0\\varepsilon\_\{\\mathrm\{merge\}\}=0, thenF\(𝜶\)=FM\(𝜶\)F\(\\boldsymbol\{\\alpha\}\)=F^\{M\}\(\\boldsymbol\{\\alpha\}\)for all𝜶∈△K\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}, so the merged full\-model proxy matches the loss of training on the expanded mixture\. If additionallyεLoRA=0\\varepsilon\_\{\\mathrm\{LoRA\}\}=0, thenFM\(𝜶\)=F^\(𝜶\)F^\{M\}\(\\boldsymbol\{\\alpha\}\)=\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)for all𝜶\\boldsymbol\{\\alpha\}, so the merged LoRA proxy is also exact\. HenceF^\(𝜶\)=F\(𝜶\)\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)=F\(\\boldsymbol\{\\alpha\}\)on△K\\triangle^\{K\}, which impliesJ^\(𝜶\)=J\(𝜶\)\\hat\{J\}\(\\boldsymbol\{\\alpha\}\)=J\(\\boldsymbol\{\\alpha\}\)because both objectives share the same regularizer\. Thereforearg⁡min⁡J^=arg⁡min⁡J\\arg\\min\\hat\{J\}=\\arg\\min J, and in particular any optimizer𝜶^\\hat\{\\boldsymbol\{\\alpha\}\}of the proxy objective is also an optimizer of the true objective\. ∎

This is the analog of Lemma 2 ofChenet al\.\[[2026](https://arxiv.org/html/2605.15220#bib.bib1)\], which shows exact recovery when the reused mixture is itself optimal\. ForOP\-Mix, the corresponding ideal condition is that the reduced\-simplex proxy surface exactly matches the true objective after expansion byEE\.

### B\.2Performance Gap Bound

We first establish a uniform approximation lemma, then use it to prove Theorem[Remark](https://arxiv.org/html/2605.15220#Thmremarkx1)\.

###### Lemma B\.1\(Uniform proxy error\)\.

For any𝛂∈△K\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}:

\|F\(𝜶\)−F^\(𝜶\)\|≤εmerge\+εLoRA\.\\displaystyle\\left\|F\(\\boldsymbol\{\\alpha\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)\\right\|\\;\\leq\\;\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\.\(13\)

###### Proof\.

By the triangle inequality, introducing the intermediate surfaceFMF^\{M\}:

\|F\(𝜶\)−F^\(𝜶\)\|\\displaystyle\\left\|F\(\\boldsymbol\{\\alpha\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)\\right\|=\|F\(𝜶\)−FM\(𝜶\)\+FM\(𝜶\)−F^\(𝜶\)\|\\displaystyle=\\left\|F\(\\boldsymbol\{\\alpha\}\)\-F^\{M\}\(\\boldsymbol\{\\alpha\}\)\+F^\{M\}\(\\boldsymbol\{\\alpha\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)\\right\|≤\|F\(𝜶\)−FM\(𝜶\)\|\+\|FM\(𝜶\)−F^\(𝜶\)\|\\displaystyle\\leq\\left\|F\(\\boldsymbol\{\\alpha\}\)\-F^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\|\+\\left\|F^\{M\}\(\\boldsymbol\{\\alpha\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)\\right\|≤sup𝜶′∈△K\|F\(𝜶′\)−FM\(𝜶′\)\|\+sup𝜶′∈△K\|FM\(𝜶′\)−F^\(𝜶′\)\|\\displaystyle\\leq\\sup\_\{\\boldsymbol\{\\alpha\}^\{\\prime\}\\in\\triangle^\{K\}\}\\left\|F\(\\boldsymbol\{\\alpha\}^\{\\prime\}\)\-F^\{M\}\(\\boldsymbol\{\\alpha\}^\{\\prime\}\)\\right\|\+\\sup\_\{\\boldsymbol\{\\alpha\}^\{\\prime\}\\in\\triangle^\{K\}\}\\left\|F^\{M\}\(\\boldsymbol\{\\alpha\}^\{\\prime\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}^\{\\prime\}\)\\right\|=εmerge\+εLoRA\.∎\\displaystyle=\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\.\\qed

###### Proof\.

We decompose the objective gap by adding and subtractingJ^\\hat\{J\}:

J\(𝜶^\)−J\(𝜶⋆\)\\displaystyle J\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-J\(\\boldsymbol\{\\alpha\}^\{\\star\}\)=\[J\(𝜶^\)−J^\(𝜶^\)\]\+\[J^\(𝜶^\)−J^\(𝜶⋆\)\]\+\[J^\(𝜶⋆\)−J\(𝜶⋆\)\]\.\\displaystyle=\\big\[J\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-\\hat\{J\}\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\\big\]\+\\big\[\\hat\{J\}\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-\\hat\{J\}\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\\big\]\+\\big\[\\hat\{J\}\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\-J\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\\big\]\.\(14\)
Middle term\.Since𝜶^=arg⁡min𝜶∈△K⁡J^\(𝜶\)\\hat\{\\boldsymbol\{\\alpha\}\}=\\arg\\min\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}\\hat\{J\}\(\\boldsymbol\{\\alpha\}\)and𝜶⋆∈△K\\boldsymbol\{\\alpha\}^\{\\star\}\\in\\triangle^\{K\}:

J^\(𝜶^\)−J^\(𝜶⋆\)≤0\.\\displaystyle\\hat\{J\}\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-\\hat\{J\}\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\\;\\leq\\;0\.\(15\)
First term\.Note thatJ\(𝜶\)−J^\(𝜶\)=F\(𝜶\)−F^\(𝜶\)J\(\\boldsymbol\{\\alpha\}\)\-\\hat\{J\}\(\\boldsymbol\{\\alpha\}\)=F\(\\boldsymbol\{\\alpha\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)for any𝜶\\boldsymbol\{\\alpha\}, since the regularization terms are identical inJJandJ^\\hat\{J\}\. Applying Lemma[B\.1](https://arxiv.org/html/2605.15220#A2.Thmlemma1):

J\(𝜶^\)−J^\(𝜶^\)=F\(𝜶^\)−F^\(𝜶^\)≤\|F\(𝜶^\)−F^\(𝜶^\)\|≤εmerge\+εLoRA\.\\displaystyle J\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-\\hat\{J\}\(\\hat\{\\boldsymbol\{\\alpha\}\}\)=F\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-\\hat\{F\}\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\\;\\leq\\;\\left\|F\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-\\hat\{F\}\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\\right\|\\;\\leq\\;\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\.\(16\)
Third term\.By the same reasoning:

J^\(𝜶⋆\)−J\(𝜶⋆\)=F^\(𝜶⋆\)−F\(𝜶⋆\)≤\|F^\(𝜶⋆\)−F\(𝜶⋆\)\|≤εmerge\+εLoRA\.\\displaystyle\\hat\{J\}\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\-J\(\\boldsymbol\{\\alpha\}^\{\\star\}\)=\\hat\{F\}\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\-F\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\\;\\leq\\;\\left\|\\hat\{F\}\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\-F\(\\boldsymbol\{\\alpha\}^\{\\star\}\)\\right\|\\;\\leq\\;\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\.\(17\)
Substituting \([15](https://arxiv.org/html/2605.15220#A2.E15)\), \([16](https://arxiv.org/html/2605.15220#A2.E16)\), and \([17](https://arxiv.org/html/2605.15220#A2.E17)\) into \([14](https://arxiv.org/html/2605.15220#A2.E14)\):

J\(𝜶^\)−J\(𝜶⋆\)\\displaystyle J\(\\hat\{\\boldsymbol\{\\alpha\}\}\)\-J\(\\boldsymbol\{\\alpha\}^\{\\star\}\)≤\(εmerge\+εLoRA\)\+0\+\(εmerge\+εLoRA\)\\displaystyle\\;\\leq\\;\(\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\)\+0\+\(\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\)=2\(εmerge\+εLoRA\)\.∎\\displaystyle\\;=\\;2\(\\varepsilon\_\{\\mathrm\{merge\}\}\+\\varepsilon\_\{\\mathrm\{LoRA\}\}\)\.\\qed

### B\.3Characterizing the Approximation Errors

The bound in Theorem[Remark](https://arxiv.org/html/2605.15220#Thmremarkx1)reduces the analysis ofOP\-Mixto boundingεmerge\\varepsilon\_\{\\mathrm\{merge\}\}andεLoRA\\varepsilon\_\{\\mathrm\{LoRA\}\}separately\. We now provide Lipschitz\-based characterizations of each\.

###### Proposition 2\(LoRA approximation bound\)\.

If each downstream metricfjf\_\{j\}isLjL\_\{j\}\-Lipschitz in model parameters with respect to the Frobenius norm, i\.e\.,\|fj\(θ\)−fj\(θ′\)\|≤Lj‖θ−θ′‖F\|f\_\{j\}\(\\theta\)\-f\_\{j\}\(\\theta^\{\\prime\}\)\|\\leq L\_\{j\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{F\}for allθ,θ′\\theta,\\theta^\{\\prime\}, then withL=1N∑j=1NLjL=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}L\_\{j\}:

εLoRA≤L⋅max⁡\{‖θoldfull−θoldLoRA‖F,maxk∈\[K\]⁡‖θDm\+kfull−θDm\+kLoRA‖F\}\.\\displaystyle\\varepsilon\_\{\\mathrm\{LoRA\}\}\\;\\leq\\;L\\cdot\\max\\\!\\Biggl\\\{\\left\\\|\\theta^\{\\mathrm\{full\}\}\_\{\\mathrm\{old\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{\\mathrm\{old\}\}\\right\\\|\_\{F\},\\;\\max\_\{k\\in\[K\]\}\\left\\\|\\theta^\{\\mathrm\{full\}\}\_\{D\_\{m\+k\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{D\_\{m\+k\}\}\\right\\\|\_\{F\}\\Biggr\\\}\.\(18\)

###### Proof\.

For any𝜶∈△K\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}:

\|FM\(𝜶\)−F^\(𝜶\)\|\\displaystyle\\left\|F^\{M\}\(\\boldsymbol\{\\alpha\}\)\-\\hat\{F\}\(\\boldsymbol\{\\alpha\}\)\\right\|=\|1N∑j=1N\[fj\(θM\(𝜶\)\)−fj\(θ^\(𝜶\)\)\]\|\\displaystyle=\\left\|\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\Big\[f\_\{j\}\\\!\\big\(\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\\big\)\-f\_\{j\}\\\!\\big\(\\hat\{\\theta\}\(\\boldsymbol\{\\alpha\}\)\\big\)\\Big\]\\right\|≤1N∑j=1NLj‖θM\(𝜶\)−θ^\(𝜶\)‖F\\displaystyle\\leq\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}L\_\{j\}\\left\\\|\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\-\\hat\{\\theta\}\(\\boldsymbol\{\\alpha\}\)\\right\\\|\_\{F\}=L‖αold\(θoldfull−θoldLoRA\)\+∑k=1Kαm\+k\(θDm\+kfull−θDm\+kLoRA\)‖F\\displaystyle=L\\left\\\|\\alpha\_\{\\mathrm\{old\}\}\\Big\(\\theta^\{\\mathrm\{full\}\}\_\{\\mathrm\{old\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{\\mathrm\{old\}\}\\Big\)\+\\sum\_\{k=1\}^\{K\}\\alpha\_\{m\+k\}\\Big\(\\theta^\{\\mathrm\{full\}\}\_\{D\_\{m\+k\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{D\_\{m\+k\}\}\\Big\)\\right\\\|\_\{F\}≤L\[αold‖θoldfull−θoldLoRA‖F\+∑k=1Kαm\+k‖θDm\+kfull−θDm\+kLoRA‖F\]\\displaystyle\\leq L\\left\[\\alpha\_\{\\mathrm\{old\}\}\\left\\\|\\theta^\{\\mathrm\{full\}\}\_\{\\mathrm\{old\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{\\mathrm\{old\}\}\\right\\\|\_\{F\}\+\\sum\_\{k=1\}^\{K\}\\alpha\_\{m\+k\}\\left\\\|\\theta^\{\\mathrm\{full\}\}\_\{D\_\{m\+k\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{D\_\{m\+k\}\}\\right\\\|\_\{F\}\\right\]≤L⋅max⁡\{‖θoldfull−θoldLoRA‖F,maxk∈\[K\]⁡‖θDm\+kfull−θDm\+kLoRA‖F\}\.∎\\displaystyle\\leq L\\cdot\\max\\\!\\Biggl\\\{\\left\\\|\\theta^\{\\mathrm\{full\}\}\_\{\\mathrm\{old\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{\\mathrm\{old\}\}\\right\\\|\_\{F\},\\;\\max\_\{k\\in\[K\]\}\\left\\\|\\theta^\{\\mathrm\{full\}\}\_\{D\_\{m\+k\}\}\-\\theta^\{\\mathrm\{LoRA\}\}\_\{D\_\{m\+k\}\}\\right\\\|\_\{F\}\\Biggr\\\}\.\\qed\(19\)

###### Proposition 3\(Merging approximation bound\)\.

Under the same Lipschitz condition as Proposition[2](https://arxiv.org/html/2605.15220#Thmproposition2), letθtrain\(𝛂\)\\theta^\{\\mathrm\{train\}\}\(\\boldsymbol\{\\alpha\}\)denote the model trained on the expanded mixtureE\(𝛂\)E\(\\boldsymbol\{\\alpha\}\), and letθM\(𝛂\)\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)denote the merged full\-model proxy\. Then:

εmerge≤L⋅sup𝜶∈△K‖θtrain\(𝜶\)−θM\(𝜶\)‖F\.\\displaystyle\\varepsilon\_\{\\mathrm\{merge\}\}\\;\\leq\\;L\\cdot\\sup\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}\\left\\\|\\theta^\{\\mathrm\{train\}\}\(\\boldsymbol\{\\alpha\}\)\-\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\\\|\_\{F\}\.\(20\)

###### Proof\.

For any𝜶∈△K\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}:

\|F\(𝜶\)−FM\(𝜶\)\|\\displaystyle\\left\|F\(\\boldsymbol\{\\alpha\}\)\-F^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\|=\|1N∑j=1N\[fj\(θtrain\(𝜶\)\)−fj\(θM\(𝜶\)\)\]\|\\displaystyle=\\left\|\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\Big\[f\_\{j\}\\\!\\big\(\\theta^\{\\text\{train\}\}\(\\boldsymbol\{\\alpha\}\)\\big\)\-f\_\{j\}\\\!\\big\(\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\\big\)\\Big\]\\right\|≤1N∑j=1NLj‖θtrain\(𝜶\)−θM\(𝜶\)‖F\\displaystyle\\leq\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}L\_\{j\}\\left\\\|\\theta^\{\\text\{train\}\}\(\\boldsymbol\{\\alpha\}\)\-\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\\\|\_\{F\}=L‖θtrain\(𝜶\)−θM\(𝜶\)‖F\.\\displaystyle=L\\left\\\|\\theta^\{\\text\{train\}\}\(\\boldsymbol\{\\alpha\}\)\-\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\\\|\_\{F\}\.Taking the supremum over𝜶\\boldsymbol\{\\alpha\}yields the result\. ∎

###### Corollary B\.2\(Merging error vanishes under linear mode connectivity\)\.

Define the linearity gap

δLMC:=sup𝜶∈△K‖θtrain\(𝜶\)−θM\(𝜶\)‖F\.\\delta\_\{\\mathrm\{LMC\}\}:=\\sup\_\{\\boldsymbol\{\\alpha\}\\in\\triangle^\{K\}\}\\left\\\|\\theta^\{\\mathrm\{train\}\}\(\\boldsymbol\{\\alpha\}\)\-\\theta^\{M\}\(\\boldsymbol\{\\alpha\}\)\\right\\\|\_\{F\}\.For anyε\>0\\varepsilon\>0, ifδLMC≤ε/L\\delta\_\{\\mathrm\{LMC\}\}\\leq\\varepsilon/L, thenεmerge≤ε\\varepsilon\_\{\\mathrm\{merge\}\}\\leq\\varepsilon\.

###### Proof\.

εmerge≤L⋅δLMC≤L⋅ε/L=ε\\varepsilon\_\{\\mathrm\{merge\}\}\\leq L\\cdot\\delta\_\{\\mathrm\{LMC\}\}\\leq L\\cdot\\varepsilon/L=\\varepsilon\. ∎

The linearity gapδLMC\\delta\_\{\\mathrm\{LMC\}\}measures how well the convex hull of the old\-data model and the new\-domain models approximates actual training on the expanded mixture\. BecauseOP\-Mixcompresses all historical data into the single componentθoldfull\\theta^\{\\mathrm\{full\}\}\_\{\\mathrm\{old\}\}, it only needs this reduced simplex to be well behaved, rather than requiring one separately trained model for every historical domain\.

Linear mode connectivity says that moving along the interpolation path between endpoint models does not create a large loss barrier\. That observation is exactly why model merging is a plausible proxy inOP\-Mix: it suggests that linear interpolation should not cause catastrophic loss blowups\.
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Similar Articles

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Scaling Laws for Mixture Pretraining Under Data Constraints

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Improvise, Adapt, Overcome: An On-The-Fly Multifidelity Algorithm for Efficient Machine Learning

Submit Feedback

Similar Articles

Data Mixing for Large Language Models Pretraining: A Survey and Outlook
Scaling Laws for Mixture Pretraining Under Data Constraints
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
Improvise, Adapt, Overcome: An On-The-Fly Multifidelity Algorithm for Efficient Machine Learning