RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

arXiv cs.CL Papers

Summary

RegMix-D extends RegMix to dynamic data mixing by using loss trajectories from proxy runs to predict optimal mixtures at multiple training stages, achieving improvements over static methods.

arXiv:2606.18663v1 Announce Type: new Abstract: Data mixture selection is critical for Large Language Model pretraining. Existing methods such as RegMix select a single static mixture by fitting a regression model on small-scale proxy runs. We propose RegMix-D, a simple extension of RegMix to dynamic mixing. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages. RegMix-D supports two deployment modes: an offline variant that generates a complete mixture schedule before target training, and an online variant that adapts the mixture during training using observed loss. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show that RegMix-D consistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy-efficient: it surpasses RegMix even with only 128 proxy models (25% of RegMix's proxy compute budget).
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:45 AM

# RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
Source: [https://arxiv.org/html/2606.18663](https://arxiv.org/html/2606.18663)
Kaiyan Zhao1,2, Zhongtao Miao1, Akiko Aizawa2, Yoshimasa Tsuruoka1 1The University of Tokyo,2National Institute of Informatics \{kaiyan1006, miao, tsuruoka\}@logos\.t\.u\-tokyo\.ac\.jp, aizawa@nii\.ac\.jp

###### Abstract

Data mixture selection is critical for Large Language Model pretraining\. Existing methods such as RegMix select a single static mixture by fitting a regression model on small\-scale proxy runs\. We proposeRegMix\-D, a simple extension of RegMix to dynamic mixing\. Our key observation is that proxy runs produce not only endpoint losses, but also full loss trajectories, which can be used to further improve data mixture\. By training regression model on these trajectories, we can predict optimal mixtures at multiple training stages\.RegMix\-Dsupports two deployment modes: an*offline*variant that generates a complete mixture schedule before target training, and an*online*variant that adapts the mixture during training using observed loss\. Experiments on 25B tokens of the Pile dataset with a 1B parameter target model show thatRegMix\-Dconsistently improves over RegMix and DoReMi across 13 downstream tasks while remaining proxy\-efficient: it surpasses RegMix even with only 128 proxy models \(25% of RegMix’s proxy compute budget\)\.

RegMix\-D: Dynamic Data Mixing via Proxy Training Trajectories

Kaiyan Zhao1,2, Zhongtao Miao1, Akiko Aizawa2, Yoshimasa Tsuruoka11The University of Tokyo,2National Institute of Informatics\{kaiyan1006, miao, tsuruoka\}@logos\.t\.u\-tokyo\.ac\.jp, aizawa@nii\.ac\.jp

## 1Introduction

Pretraining Large Language Models \(LLMs\) requires assembling massive corpora from diverse domains\(Belenki et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib1); Chen et al\.,[2026](https://arxiv.org/html/2606.18663#bib.bib4)\)\. The mixture proportions among these domains substantially affect downstream performance, making data mixture selection a critical design choice\(Feng et al\.,[2024](https://arxiv.org/html/2606.18663#bib.bib7)\)\. Recent work has proposed automated methods to optimize these proportions, including DoReMi\(Xie et al\.,[2023](https://arxiv.org/html/2606.18663#bib.bib29)\), which uses group DRO\(Oren et al\.,[2019](https://arxiv.org/html/2606.18663#bib.bib21)\)over a reference model, and RegMix\(Liu et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib16)\), which trains a regression model on a set of small proxy models to predict loss as a function of mixture\.

While effective, these methods share a common assumption: a single optimal mixture exists and can be used throughout pretraining\. However, the optimal composition of different domains may shift during training\(Mo et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib20)\)\. Under this view, static mixtures necessarily compromise across training, leading to suboptimal performance\.

Several recent methods adopt dynamic mixing to address this limitation\(Chen et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib3); Wang et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib27); Ma et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib18)\)\. However, these methods introduce additional overhead on top of standard training\. See Section[2](https://arxiv.org/html/2606.18663#S2)for discussion\.

To this end, we proposeRegMix\-D, a simple extension of RegMix to dynamic mixing\. Our key observation is that the loss trajectories collected during proxy runs already contain the information needed for dynamic prediction: each proxy run produces a full loss curve, not just an endpoint\. We train regression models that take the current training state:\(t,m,ℓ\)\(t,m,\\ell\), wherettis the proxy step,mmis the current mixture, andℓ\\ellis the current loss, to predict the loss at the next sampled step\. This model can be deployed in two modes \(Figure[1](https://arxiv.org/html/2606.18663#S1.F1)\): an*offline*mode that recursively generates mixture schedules before target training begins, and an*online*mode that queries the regression model during training and adapts the mixture in place\. Neither mode introduces additional optimization mechanisms\.

![Refer to caption](https://arxiv.org/html/2606.18663v1/x1.png)Figure 1:Overview ofRegMix\-D\. We train a regression modelffon proxy loss trajectories \(left\), then deployffin two modes \(right\):*Offline*recursively generates a complete mixture schedule before target training;*Online*queriesffduring target training using observed losses to adapt the mixture in place\.We evaluateRegMix\-Don 25B tokens of the Pile dataset\(Gao et al\.,[2020](https://arxiv.org/html/2606.18663#bib.bib8)\)with a 1B parameter target model, following the standard RegMix setup\. Our contributions are as follows:

- •We introduceRegMix\-D, a dynamic extension of RegMix with offline and online variants, requiring no additional training machinery beyond the existing regression framework\.
- •Across all tested granularities,RegMix\-Dconsistently outperforms RegMix on validation loss and achieves superior performance across 13 downstream tasks\.
- •RegMix\-Dis data\-efficient: 128 proxy models \(25% of RegMix’s compute budget\) suffice to surpass RegMix with 512 proxy models\.

## 2Related Work

#### Static Data Mixing\.

Conventional methods select a single mixture for the whole training\. DoReMi\(Xie et al\.,[2023](https://arxiv.org/html/2606.18663#bib.bib29)\)uses group DRO with a reference model to upweight domains with high excess loss\. DoGE\(Fan et al\.,[2024](https://arxiv.org/html/2606.18663#bib.bib6)\)reweights domains based on gradient\-level generalization contribution\. RegMix\(Liu et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib16)\)trains small proxy models with diverse mixtures and fits a regression model that predicts validation loss from mixture candidates, then selects the optimal mixture by search\. An alternative line of work selects mixtures via weighted model merging rather than direct proxy training\(Li et al\.,[2026](https://arxiv.org/html/2606.18663#bib.bib14); Wang et al\.,[2026](https://arxiv.org/html/2606.18663#bib.bib26)\)\.RegMix\-Dbuilds directly on RegMix’s regression framework, generalizing single\-mixture selection to time\-varying schedules\.

#### Dynamic Data Mixing

Several recent methods adopt time\-varying mixtures\. Aioli\(Chen et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib3)\)estimates mixing\-law parameters online during target training and updates proportions via exponentiated gradient descent\. TiKMiX\(Wang et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib27)\)reweights domains using group gradient influence on a target distribution\. AC\-ODM\(Ma et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib18)\)trains an actor\-critic network on proxy models and transfers it to target training\.RegMix\-Ddiffers from them along three axes: \(i\)*No additional machinery during target training*: existing dynamic methods introduce online optimization, gradient\-influence computation, or RL queries on top of target training, whereasRegMix\-Dconfines all optimization to the proxy phase\. \(ii\)*Reuses RegMix’s signal*: we predict from proxy loss trajectories already produced in the RegMix pipeline, requiring no new instrumentation\. \(iii\)*Two deployment modes*: a single regression model supports both offline and online adaptation\.

## 3RegMix\-D

### 3\.1Preliminaries

RegMix\(Liu et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib16)\)selects a single static data mixture by trainingMMsmall*proxy*models \(one per sampled mixture\) for a fixed number of stepsTpT\_\{p\}, recording the validation loss at the final step, and fitting a regression model that maps a mixture to its predicted validation loss\. The optimal mixture is then identified by searching over candidate mixtures and selecting those with the lowest predicted loss\. The*target*model is subsequently trained from scratch with this single mixture for the entire run\. This discards two sources of information in proxy runs: \(i\) intermediate losses along each proxy trajectory, and \(ii\) how the optimal mixture might evolve across training\.

### 3\.2Trajectory\-Conditioned Regression

We log validation loss at multiple step intervals during training instead of only the endpoint loss\. Given a desired number of switch pointsNN\(at which the mixture would change\), we selectN\+1N\+1intervals\{t1,t2,…,tN\+1\}\\\{t\_\{1\},t\_\{2\},\\ldots,t\_\{N\+1\}\\\}from the proxy run\. For each proxy run with mixturem\(i\)m^\{\(i\)\}, this yields a trajectory\{ℓ1\(i\),…,ℓN\+1\(i\)\}\\\{\\ell\_\{1\}^\{\(i\)\},\\ldots,\\ell\_\{N\+1\}^\{\(i\)\}\\\}of validation losses\.

We construct training pairs for a new regression modelfNf\_\{N\}over consecutive intervals as shown in Figure[1](https://arxiv.org/html/2606.18663#S1.F1), Part 1\. For each proxyiiand each adjacent pair\(tℓ,tℓ\+1\)\(t\_\{\\ell\},t\_\{\\ell\+1\}\), we form one training example:

\(tℓ,m\(i\),ℓℓ\(i\)\)⏟input⟶ℓℓ\+1\(i\)⏟target\.\\underbrace\{\\big\(t\_\{\\ell\},\\;m^\{\(i\)\},\\;\\ell\_\{\\ell\}^\{\(i\)\}\\big\)\}\_\{\\text\{input\}\}\\;\\longrightarrow\\;\\underbrace\{\\ell\_\{\\ell\+1\}^\{\(i\)\}\}\_\{\\text\{target\}\}\.\(1\)That is,fNf\_\{N\}learns to predict the validation loss at the next step interval, conditioned on \(a\) the current proxy steptℓt\_\{\\ell\}, \(b\) the current mixturem\(i\)m^\{\(i\)\}, and \(c\) the current observed lossℓℓ\(i\)\\ell\_\{\\ell\}^\{\(i\)\}\. This producesM×NM\\times Ntraining pairs in total\. Crucially,fNf\_\{N\}predicts a*local*transition\(tℓ,m,ℓℓ\)→ℓℓ\+1\(t\_\{\\ell\},m,\\ell\_\{\\ell\}\)\\rightarrow\\ell\_\{\\ell\+1\}rather than a global mixture\-to\-endpoint map\. This locality is what enables dynamic scheduling: at any training state, we can askfNf\_\{N\}which mixture is expected to drive the loss lowest in the next interval\. Note thatfNf\_\{N\}is specific to a givenNN, since the training pairs depend on the interval numbers\.

### 3\.3Offline Mixture Schedule \(Algorithm[1](https://arxiv.org/html/2606.18663#alg1)\)

Given the regression modelfNf\_\{N\}trained at the desiredNN, we can generate mixture schedules withNNswitch points\{s1,s2,…,sN\}\\\{s\_\{1\},s\_\{2\},\\ldots,s\_\{N\}\\\}, corresponding to the step intervals used to trainfNf\_\{N\}\. The schedule divides target training intoN\+1N\+1segments\. The first segment uses the human\-curated Pile priorm0humanm\_\{0\}^\{\\text\{human\}\}, reflecting the assumption that no prior knowledge about the target corpus is available at the start of training, and each subsequent segment uses a regression\-predicted mixture\.

Starting from an initial lossℓ0\\ell\_\{0\}, we iterate over switch pointss1,…,sNs\_\{1\},\\ldots,s\_\{N\}\. At each switch pointsjs\_\{j\}, we search for the mixturemj∗m\_\{j\}^\{\*\}minimizingfN​\(sj,m,ℓj−1\)f\_\{N\}\(s\_\{j\},m,\\ell\_\{j\-1\}\)over Dirichlet\-sampled candidates, record\(sj,mj∗\)\(s\_\{j\},m\_\{j\}^\{\*\}\)as one schedule entry, and updateℓj←fN​\(sj,mj∗,ℓj−1\)\\ell\_\{j\}\\leftarrow f\_\{N\}\(s\_\{j\},m\_\{j\}^\{\*\},\\ell\_\{j\-1\}\)as the input loss for the next iteration\. The novelty lies in howfNf\_\{N\}is queried across theNNswitch points: predictions are made*recursively*, with each prediction’s output loss fed into the next prediction as input\. The final schedule\{\(sj,mj∗\)\}j=1N\\\{\(s\_\{j\},m\_\{j\}^\{\*\}\)\\\}\_\{j=1\}^\{N\}is mapped to target training viasjtarget=sj⋅\(Ttarget/Tp\)s\_\{j\}^\{\\text\{target\}\}=s\_\{j\}\\cdot\(T\_\{\\text\{target\}\}/T\_\{p\}\), whereTtargetT\_\{\\text\{target\}\}stands for the training steps of the target model\.

### 3\.4Online Dynamic Mixture \(Algorithm[2](https://arxiv.org/html/2606.18663#alg2)\)

The recursive use of predictions in offline mode may iteratively accumulate error\. The online variant replaces predicted losses with*observed*target\-model losses at each switch point, grounding subsequent mixture decisions during target training\.

Concretely, the same regression modelfNf\_\{N\}trained on proxy trajectories is loaded as a frozen predictor at the start of target training\. At each switch pointSj=sj⋅\(Ttarget/Tp\)S\_\{j\}=s\_\{j\}\\cdot\(T\_\{\\text\{target\}\}/T\_\{p\}\), we queryfNf\_\{N\}using the validation lossℓ^Sj\\hat\{\\ell\}\_\{S\_\{j\}\}measured on the target model, then select the next segment’s mixture via the same Dirichlet\-sampling and top\-kkaveraging used in the offline variant\.

#### Cross\-scale loss correction\.

BecausefNf\_\{N\}is trained on proxy\-scale losses \(proxy model sizePpP\_\{p\}\) while target losses come from a much larger model \(sizePtP\_\{t\}\), the two are not directly comparable\. We apply a power\-law correction to map target losses to proxy scale before invokingfNf\_\{N\}\(Kaplan et al\.,[2020](https://arxiv.org/html/2606.18663#bib.bib10)\):

ℓ~Sj=ℓ^Sj⋅\(Pt/Pp\)β,\\tilde\{\\ell\}\_\{S\_\{j\}\}\\;=\\;\\hat\{\\ell\}\_\{S\_\{j\}\}\\cdot\\bigl\(P\_\{t\}/P\_\{p\}\\bigr\)^\{\\beta\},\(2\)whereβ\\betais a small positive constant\. In our setupPt/Pp=1000P\_\{t\}/P\_\{p\}=1000, and we useβ=0\.05\\beta=0\.05, selected from a sweep overβ∈\{0\.01,0\.02,0\.05,0\.10\}\\beta\\in\\\{0\.01,0\.02,0\.05,0\.10\\\}; final downstream performance varies by less than0\.250\.25across this range \(Table[3](https://arxiv.org/html/2606.18663#A1.T3)\), indicating that online adaptation is robust to the precise choice ofβ\\beta\. The query at switch pointSjS\_\{j\}thus becomesfN​\(sj,m,ℓ~Sj\)f\_\{N\}\(s\_\{j\},\\,m,\\,\\tilde\{\\ell\}\_\{S\_\{j\}\}\)\.

## 4Experiments

Table 1:Downstream performance on 13 tasks, averaged over 0\-5 shot settings\. All methods train a 1B target model on 25B tokens\. RegMix\-D variants useN=5N=5switch points\.Boldandunderlineindicate the best/second best result per row\. RegMix\-D \(128\) uses 128 proxy models, approximately 25% of RegMix’s 512\-proxy compute budget\.### 4\.1Experimental Setup

FollowingLiu et al\. \([2025](https://arxiv.org/html/2606.18663#bib.bib16)\), we use the 1B TinyLlama\(Zhang et al\.,[2024](https://arxiv.org/html/2606.18663#bib.bib31)\)trained on 25B Pile\(Gao et al\.,[2020](https://arxiv.org/html/2606.18663#bib.bib8)\)dataset \(17 domains\) as the target model and a 1M parameter TinyLlama variant as the proxy model in the main experiments\. We compare against three baselines: Human, DoReMi\(Xie et al\.,[2023](https://arxiv.org/html/2606.18663#bib.bib29)\), and RegMix\(Liu et al\.,[2025](https://arxiv.org/html/2606.18663#bib.bib16)\)\. We reportRegMix\-D \(128\)andRegMix\-D \(512\), and two deployment variants for each:offlineandonlinewithN=5N=5switch points in main experiments\. Average results for 13 tasks fromlm\-eval\-harness111[https://github\.com/EleutherAI/lm\-evaluation\-harness](https://github.com/EleutherAI/lm-evaluation-harness)\(Gao et al\.,[2024](https://arxiv.org/html/2606.18663#bib.bib9)\)are reported\. More details can be found in Appendix[A\.1](https://arxiv.org/html/2606.18663#A1.SS1)\.

### 4\.2Main Results

Table[1](https://arxiv.org/html/2606.18663#S4.T1)reports downstream performance, whereRegMix\-Dconsistently outperforms baselines at both proxy budgets\. Notably, even with only 128 proxies, 25% of the proxy compute used by RegMix, bothRegMix\-Dvariants still surpass RegMix in average score\. Figure[2](https://arxiv.org/html/2606.18663#A1.F2)shows the pile\-cc validation loss across training: all fourRegMix\-Dvariants achieve consistently lower loss than RegMix\. The corresponding dynamic mixtures differ substantially from RegMix’s static mixture, and we visualize the pile\-cc weight trajectory in Figure[3](https://arxiv.org/html/2606.18663#A1.F3), Appendix[A\.4](https://arxiv.org/html/2606.18663#A1.SS4)\.

Online generally outperforms offline\.Across both proxy budgets, the online variant achieves higher average accuracy than the offline variant\. We attribute this to the gap between proxy and target dynamics: offline schedules are generated entirely from proxy loss trajectories, while online predictions condition on the actual target model’s observed loss at each switch point\. This grounding in real target dynamics provides a more accurate signal for downstream performances\.

Improvements are broad rather than concentrated\.RegMix\-Dvariants achieve top\-1 or top\-2 performance on 11 of 13 tasks\. We do not observe a single task category \(e\.g\., commonsense, reading comprehension\) whereRegMix\-Dsystematically wins or loses, suggesting the gains stem from a generally better mixture rather than implicit specialization toward particular tasks\.

Compute scaling helps modestly\.Increasing proxies from 128 to 512 yields a smaller average gain\. This suggests that the dynamic\-vs\-static gap dominates the proxy\-count gap in our setup: RegMix\-D extracts more signal from each proxy run than RegMix does, by exploiting the full loss trajectory rather than a single endpoint\. We further analyze proxy count in the next section\.

### 4\.3Ablation Study

Table[3](https://arxiv.org/html/2606.18663#A1.T3), Appendix, reports ablations on the three main design choices ofRegMix\-D\.All 26 configurations across the three studies outperform the RegMix baseline \(47\.41\), indicating that the benefit of dynamic mixing is robust to specific hyperparameter choices rather than reliant on careful tuning\. Three more specific findings follow\. First, the number of switch pointsNNhas a moderate but consistent effect:N=5N\{=\}5is best in all four proxy/variant settings, whileN=3,7,9N\{=\}3,7,9also reliably surpass RegMix\. Second, the cross\-scale loss correction factorβ\\betahas limited influence on the Online variant, with maximum Avg variation of0\.250\.25\. This robustness justifies treatingβ\\betaas a fixed constant rather than a tuned hyperparameter\. Third, increasing the proxy model size from 1M to 120M yields virtually identical performance \(within0\.050\.05Avg for both variants\), a consistent finding withLiu et al\. \([2025](https://arxiv.org/html/2606.18663#bib.bib16)\)\. We discuss the overhead ofRegMix\-Din Appendix[A\.2](https://arxiv.org/html/2606.18663#A1.SS2)\.

## 5Conclusion

We presentRegMix\-D, a simple regression\-based extension ofRegMixthat generates dynamic data mixture ratios, with both offline and online variants and no additional optimization machinery\. On the Pile dataset with a 1B\-parameter target model,RegMix\-Dimproves overRegMixand DoReMi across 13 downstream tasks, and surpassesRegMixeven with 25% of its proxy compute\. Ablation studies show the gains are robust to switch\-point count, scaling factor, and proxy size, indicating that effective dynamic mixing is achievable at low cost\.

## Limitations

RegMix\-Dhas several limitations that motivate future work\. First, our experiments use a single target\-domain validation loss \(pile\-cc\) as the optimization signal, following RegMix’s setup\. ExtendingRegMix\-Dto multi\-target or domain\-agnostic objectives requires a non\-trivial extension of the regression framework and is left to future work\. Second, while several recent dynamic data mixing methods \(e\.g\., Aioli, TiKMiX\) report strong results in their respective settings, fair head\-to\-head comparison is challenging: existing works use different corpora \(e\.g\., SlimPajama vs\. Pile\), domain partitions, target model scales, and downstream benchmark suites, making direct numerical comparison costly to reproduce within a single setup\. We therefore restrict our quantitative comparisons to RegMix and DoReMi, which shareRegMix\-D’s deployment setup\.

## Ethnical Statements

This work studies data mixture selection for language model pretraining\. We use the Pile\(Gao et al\.,[2020](https://arxiv.org/html/2606.18663#bib.bib8)\), a publicly released corpus, and follow its existing domain partitioning\.RegMix\-Ddoes not introduce new training data, generation capabilities, or downstream applications beyond those already present in standard pretraining pipelines, and we therefore do not foresee direct ethical risks specific to our method\. We use LLMs only for language and grammar polishing\.

## References

- Belenki et al\. \(2025\)Lior Belenki, Alekh Agarwal, Tianze Shi, and Kristina Toutanova\. 2025\.[Optimizing pre\-training data mixtures with mixtures of data expert models](https://doi.org/10.18653/v1/2025.acl-long.1564)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 32570–32587, Vienna, Austria\. Association for Computational Linguistics\.
- Bisk et al\. \(2020\)Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi\. 2020\.Piqa: Reasoning about physical commonsense in natural language\.In*Thirty\-Fourth AAAI Conference on Artificial Intelligence*\.
- Chen et al\. \(2025\)Mayee F Chen, Michael Y\. Hu, Nicholas Lourie, Kyunghyun Cho, and Christopher Re\. 2025\.[Aioli: A unified optimization framework for language model data mixing](https://openreview.net/forum?id=sZGZJhaNSe)\.In*The Thirteenth International Conference on Learning Representations*\.
- Chen et al\. \(2026\)Zhuo Chen, Yuxuan Miao, and Deyi Xiong\. 2026\.[Data mixing for large language models pretraining: A survey and outlook](https://doi.org/10.3724/2096-7004.di.2026.0055)\.*Data Intelligence*, 8\(1\):15–55\.
- Clark et al\. \(2018\)Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\. 2018\.Think you have solved question answering? try arc, the ai2 reasoning challenge\.*ArXiv*, abs/1803\.05457\.
- Fan et al\. \(2024\)Simin Fan, Matteo Pagliardini, and Martin Jaggi\. 2024\.[DOGE: Domain reweighting with generalization estimation](https://proceedings.mlr.press/v235/fan24e.html)\.In*Proceedings of the 41st International Conference on Machine Learning*, volume 235 of*Proceedings of Machine Learning Research*, pages 12895–12915\. PMLR\.
- Feng et al\. \(2024\)Steven Feng, Shrimai Prabhumoye, Kezhi Kong, Dan Su, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro\. 2024\.[Maximize your data’s potential: Enhancing llm accuracy with two\-phase pretraining](https://arxiv.org/abs/2412.15285)\.*Preprint*, arXiv:2412\.15285\.
- Gao et al\. \(2020\)Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy\. 2020\.[The pile: An 800gb dataset of diverse text for language modeling](https://arxiv.org/abs/2101.00027)\.*Preprint*, arXiv:2101\.00027\.
- Gao et al\. \(2024\)Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others\. 2024\.[The language model evaluation harness](https://doi.org/10.5281/zenodo.12608602)\.
- Kaplan et al\. \(2020\)Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\. 2020\.[Scaling laws for neural language models](https://arxiv.org/abs/2001.08361)\.*Preprint*, arXiv:2001\.08361\.
- Ke et al\. \(2017\)Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie\-Yan Liu\. 2017\.Lightgbm: a highly efficient gradient boosting decision tree\.In*Proceedings of the 31st International Conference on Neural Information Processing Systems*, NIPS’17, page 3149–3157, Red Hook, NY, USA\. Curran Associates Inc\.
- Khashabi et al\. \(2018\)Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth\. 2018\.[Looking beyond the surface: A challenge set for reading comprehension over multiple sentences](https://doi.org/10.18653/v1/N18-1023)\.In*Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\)*, pages 252–262, New Orleans, Louisiana\. Association for Computational Linguistics\.
- Lai et al\. \(2017\)Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy\. 2017\.[RACE: Large\-scale ReAding comprehension dataset from examinations](https://doi.org/10.18653/v1/D17-1082)\.In*Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing*, pages 785–794, Copenhagen, Denmark\. Association for Computational Linguistics\.
- Li et al\. \(2026\)Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, and Shaosheng Cao\. 2026\.Decouple searching from training: Scaling data mixing via model merging for large language model pre\-training\.*arXiv preprint arXiv:2602\.00747*\.
- Liu et al\. \(2020\)Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang\. 2020\.[Logiqa: A challenge dataset for machine reading comprehension with logical reasoning](https://arxiv.org/abs/2007.08124)\.*Preprint*, arXiv:2007\.08124\.
- Liu et al\. \(2025\)Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, and Min Lin\. 2025\.[Regmix: Data mixture as regression for language model pre\-training](https://openreview.net/forum?id=5BjQOUXq7i)\.In*The Thirteenth International Conference on Learning Representations*\.
- Loshchilov and Hutter \(2019\)Ilya Loshchilov and Frank Hutter\. 2019\.[Decoupled weight decay regularization](https://openreview.net/forum?id=Bkg6RiCqY7)\.In*International Conference on Learning Representations*\.
- Ma et al\. \(2025\)Jing Ma, Chenhao Dang, and Mingjie Liao\. 2025\.[Actor\-critic based online data mixing for language model pre\-training](https://arxiv.org/abs/2505.23878)\.*Preprint*, arXiv:2505\.23878\.
- Mihaylov et al\. \(2018\)Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal\. 2018\.Can a suit of armor conduct electricity? a new dataset for open book question answering\.In*EMNLP*\.
- Mo et al\. \(2025\)Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxiang Zeng\. 2025\.[Mid\-training of large language models: A survey](https://arxiv.org/abs/2510.06826)\.*Preprint*, arXiv:2510\.06826\.
- Oren et al\. \(2019\)Yonatan Oren, Shiori Sagawa, Tatsunori B\. Hashimoto, and Percy Liang\. 2019\.[Distributionally robust language modeling](https://doi.org/10.18653/v1/D19-1432)\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 4227–4237, Hong Kong, China\. Association for Computational Linguistics\.
- Paperno et al\. \(2016\)Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández\. 2016\.[The LAMBADA dataset: Word prediction requiring a broad discourse context](https://doi.org/10.18653/v1/P16-1144)\.In*Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1525–1534, Berlin, Germany\. Association for Computational Linguistics\.
- Sakaguchi et al\. \(2019\)Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi\. 2019\.Winogrande: An adversarial winograd schema challenge at scale\.*arXiv preprint arXiv:1907\.10641*\.
- Sarlin et al\. \(2020\)Paul\-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich\. 2020\.[SuperGlue: Learning feature matching with graph neural networks](https://arxiv.org/abs/1911.11763)\.In*CVPR*\.
- Wang et al\. \(2018\)Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman\. 2018\.[GLUE: A multi\-task benchmark and analysis platform for natural language understanding](https://doi.org/10.18653/v1/W18-5446)\.In*Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP*, pages 353–355, Brussels, Belgium\. Association for Computational Linguistics\.
- Wang et al\. \(2026\)Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, and Jun Zhou\. 2026\.[Mergemix: Optimizing mid\-training data mixtures via learnable model merging](https://arxiv.org/abs/2601.17858)\.*Preprint*, arXiv:2601\.17858\.
- Wang et al\. \(2025\)Yifan Wang, Binbin Liu, Fengze Liu, Yuanfan Guo, Jiyao Deng, Xuecheng Wu, Weidong Zhou, Xiaohuan Zhou, and Taifeng Wang\. 2025\.[Tikmix: Take data influence into dynamic mixture for language model pre\-training](https://arxiv.org/abs/2508.17677)\.*Preprint*, arXiv:2508\.17677\.
- Welbl et al\. \(2017\)Johannes Welbl, Nelson F\. Liu, and Matt Gardner\. 2017\.Crowdsourcing multiple choice science questions\.In*NUT@EMNLP*\.
- Xie et al\. \(2023\)Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V Le, Tengyu Ma, and Adams Wei Yu\. 2023\.[Doremi: Optimizing data mixtures speeds up language model pretraining](https://openreview.net/forum?id=lXuByUeHhd)\.In*Thirty\-seventh Conference on Neural Information Processing Systems*\.
- Zellers et al\. \(2019\)Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi\. 2019\.[HellaSwag: Can a machine really finish your sentence?](https://doi.org/10.18653/v1/P19-1472)In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, pages 4791–4800, Florence, Italy\. Association for Computational Linguistics\.
- Zhang et al\. \(2024\)Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu\. 2024\.[Tinyllama: An open\-source small language model](https://arxiv.org/abs/2401.02385)\.*Preprint*, arXiv:2401\.02385\.

## Appendix AAppendix

### A\.1Implementation Details

The hyperparameters we used are: AdamW optimizer\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.18663#bib.bib17)\)with weight decay 0\.1, learning rate 4e\-4, context length 2048, and global batch size 512 \(achieved via gradient accumulation\)\. Proxy models are trained on 1 H800 GPU for 1,000 steps \(1M tokens per step\) and the target model is trained on 8 GPUs for 25,000 steps, totaling 25B tokens\. Human stands for the original Pile token distribution\. The adjusted 17 domains are fromLiu et al\. \([2025](https://arxiv.org/html/2606.18663#bib.bib16)\)\.

For all baselines \(Human, DoReMi, and RegMix\), we use the open\-sourced mixtures released and train target models under our exact hyperparameter setup\. This isolates the contribution of the mixture itself from any unrelated training\-recipe differences\. All target model runs use a single seed\. We use LightGBM\(Ke et al\.,[2017](https://arxiv.org/html/2606.18663#bib.bib11)\)as the regression model, matching the choice in RegMix\.

We use the Pile\-CC validation loss as the target predicted loss\. Predictions over candidate mixtures are made by Dirichlet sampling 100K candidates and averaging the top\-128 lowest predicted validation loss\.

The full list of our evaluated tasks: HellaSwag\(Zellers et al\.,[2019](https://arxiv.org/html/2606.18663#bib.bib30)\), PIQA\(Bisk et al\.,[2020](https://arxiv.org/html/2606.18663#bib.bib2)\), OpenBookQA\(Mihaylov et al\.,[2018](https://arxiv.org/html/2606.18663#bib.bib19)\), LAMBADA\(Paperno et al\.,[2016](https://arxiv.org/html/2606.18663#bib.bib22)\), SciQ\(Welbl et al\.,[2017](https://arxiv.org/html/2606.18663#bib.bib28)\), ARC\-Easy\(Clark et al\.,[2018](https://arxiv.org/html/2606.18663#bib.bib5)\), ARC\-Challenge\(Clark et al\.,[2018](https://arxiv.org/html/2606.18663#bib.bib5)\), COPA\(Sarlin et al\.,[2020](https://arxiv.org/html/2606.18663#bib.bib24)\), RACE\(Lai et al\.,[2017](https://arxiv.org/html/2606.18663#bib.bib13)\), LogiQA\(Liu et al\.,[2020](https://arxiv.org/html/2606.18663#bib.bib15)\), QQP\(Wang et al\.,[2018](https://arxiv.org/html/2606.18663#bib.bib25)\), WinoGrande\(Sakaguchi et al\.,[2019](https://arxiv.org/html/2606.18663#bib.bib23)\), and MultiRC\(Khashabi et al\.,[2018](https://arxiv.org/html/2606.18663#bib.bib12)\)\.

Algorithm 1Offline Mixture Schedule Generation1:regression model

fNf\_\{N\}, switch points

\{s1,…,sN\}\\\{s\_\{1\},\\ldots,s\_\{N\}\\\}, initial loss

ℓ0=1M​∑i=1Mℓ1\(i\)\\ell\_\{0\}=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\ell\_\{1\}^\{\(i\)\}, candidate count

CC, top\-

kk, human prior

m0humanm\_\{0\}^\{\\text\{human\}\}, concentration

α\\alpha
2:schedule

𝒮=\{\(sj,mj∗\)\}j=1N\\mathcal\{S\}=\\\{\(s\_\{j\},m\_\{j\}^\{\*\}\)\\\}\_\{j=1\}^\{N\}
3:

𝒮←∅\\mathcal\{S\}\\leftarrow\\emptyset
4:for

j=1,…,Nj=1,\\ldots,Ndo

5:Sample

\{m\(c\)\}c=1C∼Dir​\(α⋅m0human\)\\\{m^\{\(c\)\}\\\}\_\{c=1\}^\{C\}\\sim\\mathrm\{Dir\}\(\\alpha\\cdot m\_\{0\}^\{\\text\{human\}\}\)
6:

ℓ^\(c\)←fN​\(sj,m\(c\),ℓj−1\)\\hat\{\\ell\}^\{\(c\)\}\\leftarrow f\_\{N\}\(s\_\{j\},m^\{\(c\)\},\\ell\_\{j\-1\}\)for

c=1,…,Cc=1,\\ldots,C
7:

ℐ←TopKc​\(−ℓ^\(c\)\)\\mathcal\{I\}\\leftarrow\\mathrm\{TopK\}\_\{c\}\\bigl\(\-\\hat\{\\ell\}^\{\(c\)\}\\bigr\)⊳\\trianglerightindices ofkklowest losses

8:

mj∗←1k​∑c∈ℐm\(c\)m\_\{j\}^\{\*\}\\leftarrow\\frac\{1\}\{k\}\\sum\_\{c\\in\\mathcal\{I\}\}m^\{\(c\)\}
9:

ℓj←fN​\(sj,mj∗,ℓj−1\)\\ell\_\{j\}\\leftarrow f\_\{N\}\(s\_\{j\},m\_\{j\}^\{\*\},\\ell\_\{j\-1\}\)⊳\\trianglerightinput loss for next iter

10:

𝒮←𝒮∪\{\(sj,mj∗\)\}\\mathcal\{S\}\\leftarrow\\mathcal\{S\}\\cup\\\{\(s\_\{j\},m\_\{j\}^\{\*\}\)\\\}
11:endfor

12:return

𝒮\\mathcal\{S\}

Algorithm 2Online Dynamic Mixture1:regression model

fNf\_\{N\}, target model

ℳ\\mathcal\{M\}, target\-step switch points

\{S1,…,SN\}\\\{S\_\{1\},\\ldots,S\_\{N\}\\\}, candidate count

CC, top\-

kk, scaling exponent

β\\beta, size ratio

Pt/PpP\_\{t\}/P\_\{p\}, human prior

m0humanm\_\{0\}^\{\\text\{human\}\}, concentration

α\\alpha
2:Initialize

ℳ\\mathcal\{M\}with mixture

m0humanm\_\{0\}^\{\\text\{human\}\}
3:foreach training step

ttdo

4:Take one training step of

ℳ\\mathcal\{M\}
5:if

tmodTeval=0t\\bmod T\_\{\\text\{eval\}\}=0then

6:

ℓ^t←validate​\(ℳ\)\\hat\{\\ell\}\_\{t\}\\leftarrow\\mathrm\{validate\}\(\\mathcal\{M\}\)⊳\\trianglerightcached for online use

7:endif

8:if

t∈\{S1,…,SN\}t\\in\\\{S\_\{1\},\\ldots,S\_\{N\}\\\}, say

t=Sjt=S\_\{j\}then

9:

ℓ~Sj←ℓ^Sj⋅\(Pt/Pp\)β\\tilde\{\\ell\}\_\{S\_\{j\}\}\\leftarrow\\hat\{\\ell\}\_\{S\_\{j\}\}\\cdot\(P\_\{t\}/P\_\{p\}\)^\{\\beta\}⊳\\trianglerightscale to proxy

10:

sj←t⋅Tp/Ttargets\_\{j\}\\leftarrow t\\cdot T\_\{p\}/T\_\{\\text\{target\}\}⊳\\trianglerightproxy\-step image

11:Sample

\{m\(c\)\}c=1C∼Dir​\(α⋅m0human\)\\\{m^\{\(c\)\}\\\}\_\{c=1\}^\{C\}\\sim\\mathrm\{Dir\}\(\\alpha\\cdot m\_\{0\}^\{\\text\{human\}\}\)
12:

ℓ^\(c\)←fN​\(sj,m\(c\),ℓ~Sj\)\\hat\{\\ell\}^\{\(c\)\}\\leftarrow f\_\{N\}\(s\_\{j\},m^\{\(c\)\},\\tilde\{\\ell\}\_\{S\_\{j\}\}\)for

c=1,…,Cc=1,\\ldots,C
13:

ℐ←TopKc​\(−ℓ^\(c\)\)\\mathcal\{I\}\\leftarrow\\mathrm\{TopK\}\_\{c\}\\bigl\(\-\\hat\{\\ell\}^\{\(c\)\}\\bigr\)
14:

mj∗←1k​∑c∈ℐm\(c\)m\_\{j\}^\{\*\}\\leftarrow\\frac\{1\}\{k\}\\sum\_\{c\\in\\mathcal\{I\}\}m^\{\(c\)\}
15:Update

ℳ\\mathcal\{M\}’s training mixture to

mj∗m\_\{j\}^\{\*\}
16:endif

17:endfor

![Refer to caption](https://arxiv.org/html/2606.18663v1/x2.png)Figure 2:Pile\-CC validation loss across training for RegMix and all fourRegMix\-Dvariants \(Offline/Online×\\times128/512 proxies\), atN=5N\{=\}5switch points\. All fourRegMix\-Dvariants achieve consistently lower validation loss than RegMix throughout training\. The inset zooms into the final 5B tokens, where RegMix\-D Online \(512\) attains the lowest loss\.
### A\.2Overhead forRegMix\-D

RegMix\-D’s online adaptation adds minimal cost to target training\. Table[2](https://arxiv.org/html/2606.18663#A1.T2)reports wall\-clock training time on 8 GPUs \(averaged over 3 runs\):RegMix\-DOffline matches RegMix exactly \(the schedule is generated before training\), whileRegMix\-Donline incurs only\+0\.37%\+0\.37\\%overhead from per\-switch\-point regression queries\. This confirms that the additional flexibility of online adaptation comes at negligible computational cost\.

Table 2:Target\-model training wall\-clock time on 8 GPUs \(25,000 steps, averaged over 3 runs\)\.As for the proxy model training, one 1M proxy model requires about 0\.5 hours on a single H800, reducing proxy number from 512 to 128 thus would reduce the total proxy training cost from approximately 256 to 64 H800 GPU\-hours\.

### A\.3From Proxy Timescale to Target Training

RegMix\-D’s regression model is trained entirely on proxy\-scale trajectories \(TpT\_\{p\}steps\), while target training is much longer \(Ttarget≫TpT\_\{\\text\{target\}\}\\gg T\_\{p\}\. We assume that proxy and target training share enough loss\-curve structure to make the proxy\-predicted switch points transferable\. Concretely, each proxy switch pointsjs\_\{j\}is linearly mapped to a target stepsj⋅\(Ttarget/Tp\)s\_\{j\}\\cdot\(T\_\{\\text\{target\}\}/T\_\{p\}\)\. This assumption underlies RegMix’s original design as well: the static mixture selected from proxy losses is similarly assumed to transfer to target training\.RegMix\-Dinherits this assumption without strengthening it\.

\(a\) Number of switch pointsNN

\(b\) Scaling factorβ\\beta

\(c\) Proxy size

Table 3:Ablations onRegMix\-D\(downstream Avg, %\)\. All configurations exceed the RegMix baseline \(47\.41\)\.\(a\)Number of switch pointsNNat the 128\- / 512\-proxy setting\.\(b\)Cross\-scale loss correction factorβ\\betafor the Online variant; maximum variation is0\.250\.25\. We useβ=0\.05\\beta\{=\}0\.05in main experiments as default\.\(c\)Proxy model size analysis at the 128\-proxy setting\.Bold: best per row\.
### A\.4Visualization of the Dynamic Mixture

![Refer to caption](https://arxiv.org/html/2606.18663v1/x3.png)Figure 3:Pile\-CC weight across training process for RegMix \(static, dashed\) andRegMix\-D\(dynamic,N=5N\{=\}5, 512 proxies\)\. TheRegMix\-Dvariants assign substantially different mixture weights than RegMix\. This illustrates the dynamic schedules discovered by our method\.Figure[3](https://arxiv.org/html/2606.18663#A1.F3)visualizes the pile\-cc weights produced byRegMix\-D\(Offline and Online\) at theN=5N\{=\}5, 512\-proxy setting, compared with RegMix’s static weight of 0\.870\. The dynamic schedules differ markedly from the static baseline: both variants assign substantially different pile\-cc weights in 4 of 5 segments\. The Offline variant exhibits a more pronounced dip in segment 2 \(0\.555\), reflecting a stronger early redistribution of weight toward non\-dominant domains, before concentrating back to high pile\-cc weights in later segments\. The Online variant follows a similar overall trajectory but with smaller fluctuations, likely because it conditions on actually observed target\-model losses rather than recursively predicted ones\.

We restrict the visualization to pile\-cc because the remaining 16 Pile domains each contribute less than 5% of the mixture, and their individual weight changes are visually difficult to distinguish on a shared axis\.

Similar Articles

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

arXiv cs.CL

This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv cs.CL

This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.