FastMix: Fast Data Mixture Optimization via Gradient Descent
Summary
FastMix is a novel framework that automates data mixture discovery for training large models using a single proxy model and bilevel optimization, achieving state-of-the-art performance with significant efficiency gains.
View Cached Full Text
Cached at: 06/16/26, 11:36 AM
# FastMix: Fast Data Mixture Optimization via Gradient Descent
Source: [https://arxiv.org/html/2606.14971](https://arxiv.org/html/2606.14971)
Haoru Tan1,2Sitong Wu3Yanfeng Chen2,†Jun Xia2Ruobing Xie2Bin Xia3Xingwu Sun2Xiaojuan Qi1,† 1University of Hong Kong2Hunyuan LLMTencent3Chinese University of Hong Kong
###### Abstract
While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre\-training and post\-training remains a significant open problem\. We address this challenge withFastMix, a novel framework that automates data mixture discovery while training only a*single proxy model*\. Instead of relying on predefined heuristics or resource\-intensive simulations,FastMixjointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches\. At the core ofFastMixis a reformulation of mixture selection as a*bilevel optimization*problem\. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per\-source loss weights under uniform source sampling\. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient\-based optimization of both mixture and model\. To solve the optimization problem,FastMiximplements an approximate iterative optimization procedure, alternating between \(i\) updating model parameters on data sampled according to current mixture ratios \(inner loop\) and \(ii\) updating mixture ratios based on validation feedback \(outer loop\)\. Across pre\- and post\-training,FastMixoutperforms baselines while drastically reducing search cost\. Code \(https://github\.com/hrtan/fastmix\)
Figure 1:Average Performance versus Time\-cost \(GPU Hours\) comparison for various data mixture strategies\.\(a\) Pre\-training:Our proposedFastMix\(ours\)method achieves the highest performance with the lowest time\-cost\. The annotations highlight that it is up to55×\\timesmore time\-efficient thanCLIMB\(Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)and550×\\timesmore time\-efficient thanRegMix\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\), while providing a significant performance gain\.\(b\) Post\-training:In this setting,FastMix\(ours\)again demonstrates state\-of\-the\-art performance and time\-efficiency, outperformingRegMixwith a52×\\timesreduction in time\-cost and gaining an additional5\.5performance points overCLIMB\. This illustrates the superior trade\-off between performance and time cost achieved by our method\.## 1Introduction
The performance of large\-scale models\(Yanget al\.,[2024b](https://arxiv.org/html/2606.14971#bib.bib13); Dubeyet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib8); Touvronet al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib1); Huet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib15)\)depends critically on the data used for training\. While large and diverse datasets have driven recent advances, identifying the optimal data mixture for pre\-training\(Shukoret al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib69)\)and post\-training\(Donget al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib72)\)remains a significant challenge\.
Popular methods such as manual trial\-and\-error\(Yanget al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib14); Tonget al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib47)\)or proxy\-based methods\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10); Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)often do not scale well as models grow larger\. For example, proxy\-based search methods such as RegMix\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\)and CLIMB\(Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)have demonstrated strong generalization and stability, yet they require training a large number of proxy models during the search\. This results in prohibitive computational overhead, making mixture optimization increasingly impractical as both models and datasets continue to expand\. The central question is thus: how can we efficiently determine effective data mixtures for large\-scale training?
We address this challenge withFastMix, a novel framework that automates data mixture discovery while training only a*single proxy model*\. Instead of relying on predefined heuristics or resource\-intensive simulations,FastMixjointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches\. At the core ofFastMixis a reformulation of mixture selection as a weighted*bilevel optimization*problem in Eq\.\([2](https://arxiv.org/html/2606.14971#S3.E2)\)\. Specifically, we show that optimizing mixture ratios is mathematically equivalent to assigning per\-source loss weights under uniform source sampling\. This reparameterization embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient\-based optimization of both mixture and model\. To solve the optimization problem\(Maclaurinet al\.,[2015](https://arxiv.org/html/2606.14971#bib.bib74); Franceschiet al\.,[2018](https://arxiv.org/html/2606.14971#bib.bib76)\),FastMiximplements an approximate iterative optimization procedure, alternating between \(i\) updating model parameters on data sampled according to current mixture ratios \(inner loop\) and \(ii\) updating mixture ratios based on validation feedback \(outer loop\) via a gradient\-based optimizer\(Kingma and Ba,[2014](https://arxiv.org/html/2606.14971#bib.bib51)\)\.
Extensive evaluations demonstrate thatFastMixoptimizes data mixtures across model scales and tasks in both pre\-training and post\-training, outperforming baselines at a fraction of the computational cost \(See Fig\.[1](https://arxiv.org/html/2606.14971#S0.F1)\)\. In pre\-training, it delivers a top average score of 48\.2 and rank 1 across 14 benchmarks \(best on 9\) with just1\.3GPU\-hours, achieving×\\times550faster than RegMix\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\)and×\\times55than CLIMB\(Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)\. In post\-training \(SFT\), a math\-tuned mixture generalizes to coding and STEM\-QA, reaching 65\.4 \(\+5\.5over next best\) in2\.2GPU\-hours versus more than115GPU\-hours for CLIMB/RegMix\. Overall,FastMixmakes mixture optimization practical and scalable for next\-generation large models\.
## 2Related Work
The rapid progress of large models\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib8); Touvronet al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib1); Allalet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib29); Yanget al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib14);[2024a](https://arxiv.org/html/2606.14971#bib.bib16)\)relies heavily on strategically mixing data from diverse sources, spanning languages\(Yanget al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib14)\), modalities\(Gunasekaret al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib19); Yanget al\.,[2024b](https://arxiv.org/html/2606.14971#bib.bib13)\), and difficulty levels\(Heet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib48)\)\. This*data mixture problem*\(Geet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib12)\)presents fundamental challenges not only in pre\-training\(Shukoret al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib69); Dubeyet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib8); Yanget al\.,[2024b](https://arxiv.org/html/2606.14971#bib.bib13)\)but also in post\-training\(Donget al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib72); Minget al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib71); Tonget al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib47)\)\. Early practice largely relied on manual heuristics, which lack standardization and often fail to generalize across settings\. More recently, optimization\-based approaches\(Xieet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib6); Fanet al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib7); Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\)have been introduced to automate mixture selection\.
Proxy\-based methods\(Xieet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib6); Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10); Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)adopt a two\-phase design in which a proxy model is trained under candidate mixtures and its performance is used to infer optimal sampling ratios\. For example, DoReMi\(Xieet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib6)\)trains a small proxy to adjust domain weights based on relative losses, then reuses the optimized ratios to train a larger model\. RegMix\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\)scales this idea by training hundreds of proxy models under different ratios, fitting a regression model on the resulting mixture\-performance pairs, and extrapolating the optimal mixture\. CLIMB\(Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)improves efficiency by iteratively refining the search region, reducing the number of proxy models required\. Other works\(Yeet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib70); Shukoret al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib69); Kanget al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib68)\)study cross\-scale transfer:Shukoret al\.\([2025](https://arxiv.org/html/2606.14971#bib.bib69)\)provide theoretical and empirical evidence that mixtures found on small models generalize to larger ones, whileYeet al\.\([2024](https://arxiv.org/html/2606.14971#bib.bib70)\); Kanget al\.\([2024](https://arxiv.org/html/2606.14971#bib.bib68)\)report functional relationships between mixture proportions and performance\.
In contrast,dynamic methods\(Chenet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib67); Minget al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib71); Albalaket al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib66)\)remove the separate search phase by adjusting mixtures on the fly\. IDEAL\(Minget al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib71)\), for instance, leverages influence functions\(Koh and Liang,[2017](https://arxiv.org/html/2606.14971#bib.bib49)\)to estimate domain contributions to downstream performance and to dynamically rebalance training data\.
Overall, proxy\-based methods such as RegMix and CLIMB generally achieve stronger and more stable performance than dynamic approaches, but at substantial computational cost\. Our method,FastMix, preserves the reliability of proxy\-based optimization while cutting search time from hundreds of GPU\-hours to nearly one, achieving both higher efficiency and stronger generalization\.
## 3FastMix
### 3\.1Problem reformulation with reparameterization
#### Data Mixture as a Bi\-level Optimization Problem\.
Formally, data mixture optimization can be posed as a bilevel optimization problem\. LetD=\{D1,…,Dk\}D=\\\{D\_\{1\},\\dots,D\_\{k\}\\\}be a collection of data sources \(or clusters\), and letα∈A⊂ℝk\\alpha\\in A\\subset\\mathbb\{R\}^\{k\}denote the mixture weights, where the feasible setAAis the probability simplex \(αi≥0\\alpha\_\{i\}\\geq 0and∑i=1kαi=1\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}=1\)\. Given mixtureα\\alphaand model parametersww, the training objective isℒtrain\(D,w∣α\)\\mathcal\{L\}\_\{\\text\{train\}\}\(D,w\\mid\\alpha\)\. Letw∗\(α\)w^\{\*\}\(\\alpha\)be the parameters obtained by \(approximately\) optimizing this training objective underα\\alpha\. The target is to find mixture weightsα∗\\alpha^\{\*\}that minimize the validation loss, i\.e\.,ℒtarget\(w\)=ℓval\(V,w\)\\mathcal\{L\}\_\{\\text\{target\}\}\(w\)=\\ell\_\{\\text\{val\}\}\(V,w\)evaluated atw∗\(α\)w^\{\*\}\(\\alpha\):
minαℒtarget\(w∗\(α\)\)s\.t\.w∗\(α\)=argminwℒtrain\(D,w\|α\),∑i=1kαi=1,αi≥0\.\\min\_\{\\alpha\}\\,\\,\\mathcal\{L\}\_\{\\text\{target\}\}\\Big\(w^\{\*\}\(\\alpha\)\\Big\)~~~~~\\text\{s\.t\.\}~~~~~w^\{\*\}\(\\alpha\)=\\arg\\min\_\{w\}\\mathcal\{L\}\_\{\\text\{train\}\}\\Big\(D,w\|\\alpha\\Big\),~~~\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}=1,~~~\\alpha\_\{i\}\\geq 0\.\(1\)where the inner\-loop aims to find the optimal model weightsw∗\(α\)w^\{\*\}\(\\alpha\)by minimizing the training loss on the dataset given mixture weightsα\\alpha\. The outer\-loop then seeks to optimize these mixture weightsα\\alphato minimize the model’s final loss on target tasks\.
While the bi\-level formulation is conceptually appealing, it is difficult to solve in practice\. The crux is handling the mixture weightsα\\alpha\. Unlike model parametersww, which admit efficient gradient\-based updates, mixture \(sampling\) ratios are typically non\-differentiable, precluding end\-to\-end backpropagation\. Consequently, practitioners resort to greedy heuristics or policy\-gradient \(score\-function\) updates to adjustα\\alpha\. These procedures are sample\-inefficient and scale poorly with the number of data sources, turning mixture search into a dominant computational bottleneck\.
#### Differentiable Formulation\.
Through a simple reparameterization, we recast the original bilevel problem into a mathematically equivalent, fully differentiable objective\. The key idea is to replace stochastic sampling by mixture ratios with per\-source, differentiable loss weights applied under uniform sampling, so that each source’s contribution is controlled continuously via its weight, yielding the following formulation:
minαℒtarget\(w∗\(α\)\)s\.t\.w∗\(α\)=argminw∑i=1kαiℒtrain\(Di,w\),∑i=1kαi=1,αi≥0,\\min\_\{\\alpha\}\\,\\,\\mathcal\{L\}\_\{\\text\{target\}\}\\Big\(w^\{\*\}\(\\alpha\)\\Big\)~~~~~\\text\{s\.t\.\}~~~w^\{\*\}\(\\alpha\)=\\arg\\min\_\{w\}\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\mathcal\{L\}\_\{\\text\{train\}\}\\Big\(D\_\{i\},w\\Big\),~~~\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}=1,~~~\\alpha\_\{i\}\\geq 0,\(2\)whereℒtrain\(Di,w\)\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w\)denotes the model’s training loss on sourceDiD\_\{i\}, computed under*uniform source sampling*\(each source selected with probability1/k1/k\)\. The inner\-loop finds the optimal model weights,w∗\(α\)w^\{\*\}\(\\alpha\), by minimizing a weighted sum of the training losses fromkkdifferent data domains\. The data mixture weightαi\\alpha\_\{i\}serves as the weight for each domain’s loss\. The outer\-loop then aims to optimize these proportionsα\\alphato minimize the model’s loss on target tasks\. This reparameterization is key: rather than treating mixture ratios as non\-differentiable sampling probabilities, we reinterpret them as continuous coefficients that scale each source’s loss\. Consequently, the mixture weights𝜶=\(α1,…,αk\)\\bm\{\\alpha\}=\(\\alpha\_\{1\},\\ldots,\\alpha\_\{k\}\)are fully differentiable and amenable to gradient\-based optimization\. Standard optimizers \(e\.g\., SGD or Adam\) can then jointly update the model parameters and the data weights, enabling efficient end\-to\-end training\.
Proof of equivalence\.LetD=⋃i=1kDiD=\\bigcup\_\{i=1\}^\{k\}D\_\{i\}denote the union ofkkdata sources \(or clusters\), and letα=\(α1,…,αk\)\\alpha=\(\\alpha\_\{1\},\\dots,\\alpha\_\{k\}\)be mixture weights with∑iαi=1\\sum\_\{i\}\\alpha\_\{i\}=1,αi≥0\\alpha\_\{i\}\\geq 0\. To sample a training examplexx, first draw a source indexi∼Cat\(α\)i\\sim\\mathrm\{Cat\}\(\\alpha\), then samplex∼Dix\\sim D\_\{i\}\. The training loss under this mixture sampling is
ℒtrain\(D,w∣α\)=𝔼i∼Cat\(α\)𝔼x∼Di\[ℓ\(x,w\)\]=∑i=1kαiℒtrain\(Di,w\),\\mathcal\{L\}\_\{\\text\{train\}\}\(D,w\\mid\\alpha\)=\\mathbb\{E\}\_\{i\\sim\\mathrm\{Cat\}\(\\alpha\)\}\\,\\mathbb\{E\}\_\{x\\sim D\_\{i\}\}\\big\[\\ell\(x,w\)\\big\]=\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\,\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w\),\(3\)whereℓ\(x,w\)\\ell\(x,w\)is the per\-example loss andℒtrain\(Di,w\)=𝔼x∼Di\[ℓ\(x,w\)\]\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w\)=\\mathbb\{E\}\_\{x\\sim D\_\{i\}\}\[\\ell\(x,w\)\]is the expected loss on sourceDiD\_\{i\}\. Thus, under mixture sampling, the expected training loss is a convex combination of the per\-source losses, with coefficients given by the mixture ratios\.
### 3\.2How to obtain better generalization performance?
Like most AutoML algorithms,FastMixrequires a search target, typically defined as a performance metric on a held\-out validation set\. However, relying on validation performance alone can lead to overfitting to quirks of the validation data and limited transferability to new scenarios\. To improve generalization, we propose two complementary strategies: \(i\) entropy\-based regularization to encourage diversity among mixture weights, and \(ii\) incorporating training loss into the search target to balance validation and training signals\.
Entropy\-based regularization\.Entropy regularization prevents the mixture distribution from collapsing onto a narrow subset of data sources\. Given mixture weights\(α1,…,αk\)\(\\alpha\_\{1\},\\dots,\\alpha\_\{k\}\)acrosskksources, we add the penaltyℛentropy=∑i=1kαilogαi\\mathcal\{R\}\_\{\\text\{entropy\}\}=\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\log\\alpha\_\{i\}\. Minimizing this term discourages overly peaked distributions, promoting more uniform weight allocation\. This reduces sensitivity to spurious validation patterns and improves robustness by leveraging multiple data sources\.
Training loss as an auxiliary target\.We further integrate the training loss into the search objective to complement the validation signal\. While the validation term reflects out\-of\-sample generalization, the training term measures how effectively the model fits the mixture as a whole\. Combining the two reduces over\-reliance on the limited validation set and guides the search toward mixture ratios that generalize more reliably across both in\-domain and out\-of\-domain data\.
Joint objective\.Together, entropy regularization and the auxiliary training loss yield the following search objective:
ℒtarget\(w\)=ℓval\(w\)\+βℒtrain\(w\)\+λ∑i=1kαilogαi,\\vskip\-5\.69046pt\\mathcal\{L\}\_\{\\text\{target\}\}\(w\)=\\ell\_\{\\text\{val\}\}\(w\)~\+~\\beta\\,\\mathcal\{L\}\_\{\\text\{train\}\}\(w\)~\+~\\lambda\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\log\\alpha\_\{i\},\(4\)whereβ≥0\\beta\\geq 0andλ≥0\\lambda\\geq 0are trade\-off hyperparameters\. Empirically,λ\\lambdais set to a small value \(e\.g\.,10−510^\{\-5\}\) to encourage diversity without dominating the optimization, whileβ\\betais most effective at moderate values \(e\.g\.,0\.10\.1\)\. We provide a detailed sensitivity analysis of these hyperparameters in our ablation studies\. Overall, these two strategies substantially improve the generalization ability ofFastMix, enabling it to discover mixtures that not only perform strongly on validation benchmarks but also transfer robustly to broader real\-world applications\.
### 3\.3Optimization
Although the reparameterized formulation enables end\-to\-end differentiation over both model parameters and data mixtures, the resulting bilevel problem is still difficult to solve directly\. Accordingly, we adopt an iterative procedure \(Alg\.[1](https://arxiv.org/html/2606.14971#alg1)\) that alternates between updating the model parameters and the mixture weights\(Maclaurinet al\.,[2015](https://arxiv.org/html/2606.14971#bib.bib74); Liuet al\.,[2018](https://arxiv.org/html/2606.14971#bib.bib73); Pedregosa,[2016](https://arxiv.org/html/2606.14971#bib.bib75); Franceschiet al\.,[2018](https://arxiv.org/html/2606.14971#bib.bib76)\)\. The two key steps are outlined below\.
\(i\) Inner loop \(network parameter update\)\.Given current mixture weightsαt\\alpha^\{t\}, the model parameterswware updated forn1n\_\{1\}steps via stochastic gradient descent \(SGD\) to minimize the weighted training lossℒtrain\\mathcal\{L\}\_\{\\text\{train\}\}:
wt\+1←wt−ηwt∂\(∑i=1kαitℒtrain\(Di,wt\)\)∂wt,w^\{t\+1\}\\leftarrow w^\{t\}\-\\eta\_\{w\}^\{t\}\\frac\{\\partial\\Big\(\\sum\_\{i=1\}^\{k\}\\alpha^\{t\}\_\{i\}\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w^\{t\}\)\\Big\)\}\{\\partial w^\{t\}\},\(5\)whereℒtrain\(Di,w\)\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w\)denotes the model’s training loss on sourceDiD\_\{i\}, computed under*uniform source sampling*\(each source selected with probability1/k1/k\)\. This is repeated forn1n\_\{1\}iterations\. Other gradient\-based optimizers, such as Adam\(Kingma and Ba,[2014](https://arxiv.org/html/2606.14971#bib.bib51)\), are compatible with our framework\. Aftern1n\_\{1\}updates, we denote the resulting parameters aswt\+n1w^\{t\+n\_\{1\}\}\.
\(ii\) Outer loop \(mixture weight update\)\.The mixture weightsαt\\alpha^\{t\}are then updated using validation feedbackℒtarget\\mathcal\{L\}\_\{\\text\{target\}\}\. Specifically, the model is trained forn2n\_\{2\}iterations with the previous mixture weightsαt\\alpha^\{t\}, and the resulting parameterswt\+n2w^\{t\+n\_\{2\}\}are evaluated on the validation lossℒtarget\\mathcal\{L\}\_\{\\text\{target\}\}\. The mixture weights are updated as:
αt\+1←αt−ηαt∂ℒtarget\(wt\+n2\)∂αt,\\alpha^\{t\+1\}\\leftarrow\\alpha^\{t\}\-\\eta\_\{\\alpha\}^\{t\}\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{target\}\}\\big\(w^\{t\+n\_\{2\}\}\\big\)\}\{\\partial\\alpha^\{t\}\},\(6\)
In effect,αt\+1\\alpha^\{t\+1\}is updated according to how the validation loss responds aftern2n\_\{2\}steps of training underαt\\alpha^\{t\}\. This naturally assigns larger weights to data sources that contribute more to improving validation performance\. A key consideration is how the gradient is estimated, since this directly impacts both the direction of updates and the efficiency of the search\.
In the special casen2=1n\_\{2\}=1with SGD updates, the gradient of the validation loss with respect toαit\\alpha^\{t\}\_\{i\}yields a closed\-form solution:
∂ℒtarget\(wt\+1\)∂αt=∂ℒtarget\(wt\+1\)∂wt\+1⋅∂wt\+1∂αit=−ηwt∇wℓval\(V,wt\+1\)⋅∇wℒtrain\(Di,wt\),\\displaystyle\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{target\}\}\\big\(w^\{t\+1\}\\big\)\}\{\\partial\\alpha^\{t\}\}=\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{target\}\}\(w^\{t\+1\}\)\}\{\\partial w^\{t\+1\}\}\\cdot\\frac\{\\partial w^\{t\+1\}\}\{\\partial\\alpha^\{t\}\_\{i\}\}=\-\\eta\_\{w\}^\{t\}\\,\\nabla\_\{w\}\\ell\_\{\\text\{val\}\}\(V,w^\{t\+1\}\)\\cdot\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w^\{t\}\),\(7\)whereDiD\_\{i\}denotes theii\-th training source\. This shows that per\-source training losses directly shape the mixture gradients\. The following derivation shows why the formula holds\. Under the SGD update rule, the weightswwat timet\+1t\+1are updated based on the gradient of the loss function with respect to the mixture coefficientsαit\\alpha\_\{i\}^\{t\}:wt\+1=wt−ηwt∇w\[∑i=1kαitℒtrain\(Di,wt\)\]w^\{t\+1\}=w^\{t\}\-\\eta^\{t\}\_\{w\}\\nabla\_\{w\}\[\\sum\_\{i=1\}^\{k\}\\alpha^\{t\}\_\{i\}\\,\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w^\{t\}\)\]\. Taking the derivative ofwt\+1w^\{t\+1\}with respect toαit\\alpha^\{t\}\_\{i\}, we get:∂wt\+1∂αit=∂∂αit\[wt−ηwt∇w\(∑j=1kαjtℒtrain\(Dj,wt\)\)\]\\frac\{\\partial w^\{t\+1\}\}\{\\partial\\alpha^\{t\}\_\{i\}\}=\\frac\{\\partial\}\{\\partial\\alpha^\{t\}\_\{i\}\}\\left\[w^\{t\}\-\\eta^\{t\}\_\{w\}\\nabla\_\{w\}\\left\(\\sum\_\{j=1\}^\{k\}\\alpha^\{t\}\_\{j\}\\,\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{j\},w^\{t\}\)\\right\)\\right\]\. Sincewtw^\{t\}is independent ofαit\\alpha^\{t\}\_\{i\}, the derivative of the first term is zero\. Due to the linearity of the derivative and the sum, only the term corresponding toαit\\alpha^\{t\}\_\{i\}remains, hence,∂wt\+1∂αit=−ηwt∇wℒtrain\(Di,wt\)\\frac\{\\partial w^\{t\+1\}\}\{\\partial\\alpha^\{t\}\_\{i\}\}=\-\\eta^\{t\}\_\{w\}\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w^\{t\}\)\.
The formulation in Eq\.\([7](https://arxiv.org/html/2606.14971#S3.E7)\) can be intuitively understood as follows: The gradient with respect toαi\\alpha\_\{i\}is proportional to the*alignment*between \(i\) the validation gradient∇wℓval\(V,wt\+1\)\\nabla\_\{w\}\\ell\_\{\\text\{val\}\}\(V,w^\{t\+1\}\)and \(ii\) the training gradient from sourceDiD\_\{i\},∇wℒtrain\(Di,wt\)\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w^\{t\}\)\. If these gradients are aligned \(positive dot product\), the derivative−ηwt∇wℓval⋅∇wℒtrain\(Di,wt\)\-\\,\\eta\_\{w\}^\{t\}\\,\\nabla\_\{w\}\\ell\_\{\\text\{val\}\}\\\!\\cdot\\\!\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w^\{t\}\)is negative, so a gradient\-descent step onαi\\alpha\_\{i\}*increases*its weight, emphasizing sources whose updates also reduce the validation loss\. If they are opposed \(negative dot product\), the derivative is positive and a step*decreases*αi\\alpha\_\{i\}, down\-weighting sources that harm validation performance\. Near\-orthogonality yields small updates\. Thus, the procedure reallocates mass toward data sources whose training signals most effectively improve the validation objective\.
Whenn2\>1n\_\{2\}\>1, deriving a closed\-form gradient becomes intractable, requiring finite\-difference approximations or similar techniques, which are often unstable and inefficient\. In contrast,n2=1n\_\{2\}=1admits a closed\-form gradient that is both computationally efficient and empirically effective\.
1:Initializemodel parameters
w0w^\{0\}, mixture weights
α0\\alpha^\{0\}, inner\-loop duration
n1n\_\{1\}and outer\-loop duration
n2n\_\{2\}\.
2:for
t=0,1,…,T−1t=0,1,\\dots,T\-1do
3:if
\(t\)modn1≠0\(t\)\\bmod n\_\{1\}\\neq 0then
4:// Inner loop: update model parameters \(e\.g\., via the SGD optimizer, and we can change this update rule to other optimizers, like Adam\(Kingma and Ba,[2014](https://arxiv.org/html/2606.14971#bib.bib51)\)\)
5:
wt\+1←wt−ηwt∂\[∑i=1kαitℒtrain\(Di,wt\)\]∂wt,w^\{t\+1\}\\leftarrow w^\{t\}\-\\eta\_\{w\}^\{t\}\\frac\{\\partial\[\\sum\_\{i=1\}^\{k\}\\alpha^\{t\}\_\{i\}\\mathcal\{L\}\_\{\\text\{train\}\}\(D\_\{i\},w^\{t\}\)\]\}\{\\partial w^\{t\}\},
6:else
7:// Outer loop: update mixture weights \(e\.g\., via the SGD optimizer, and we can change this update rule to other optimizers, like Adam\(Kingma and Ba,[2014](https://arxiv.org/html/2606.14971#bib.bib51)\)\)
8:
αt\+1←αt−ηαt∂ℒtarget\(wt\+n2\)∂αt\\alpha^\{t\+1\}\\leftarrow\\alpha^\{t\}\-\\eta\_\{\\alpha\}^\{t\}\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{target\}\}\\big\(w^\{t\+n\_\{2\}\}\\big\)\}\{\\partial\\alpha^\{t\}\}
9:endif
10:endfor
11:Output: the optimized mixture weight
afinala^\{\\text\{final\}\}after the final outer loop update\.
Algorithm 1FastMixOptimization Algorithm
## 4Experiments
To comprehensively evaluate the effectiveness of our proposed framework, we conduct experiments on data mixture optimization across different stages of large language model \(LLM\) training, including both pre\-training and post\-training\. The compared methods cover a wide spectrum of approaches, ranging from human expert tuning to proxy\-based search methods such as DoReMi\(Xieet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib6)\), RegMix\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\)and CLIMB\(Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\), and dynamic methods, including ODM\(Albalaket al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib66)\)and IDEAL\(Minget al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib71)\)\. The subsequent sections are organized as follows: Section[4\.1](https://arxiv.org/html/2606.14971#S4.SS1)presents results on pre\-training mixture optimization\. Section[4\.2](https://arxiv.org/html/2606.14971#S4.SS2)reports experiments in post\-training settings\.
### 4\.1Pre\-training Stage Experiments
Setups\.Following prior work\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\), we conduct our experiments on the Pile dataset\(Gaoet al\.,[2020](https://arxiv.org/html/2606.14971#bib.bib60)\), focusing on the 17 uncopyrighted subsets available on HuggingFace\. For mixture optimization in the pre\-training stage, we employ small proxy models \(e\.g\., 1M parameters\) trained on up to 1B tokens\. To test the method’s generalization ability, consistent withLiuet al\.\([2024](https://arxiv.org/html/2606.14971#bib.bib10)\), we use the loss on a representative and diverse part of the training data \(the Pile\-cc sub\-set\(Gaoet al\.,[2020](https://arxiv.org/html/2606.14971#bib.bib60)\)\) as the search target\. ForFastMix, we employ only a single proxy model, whereas RegMix uses 512 by following\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\)proxy models and CLIMB uses 64\(Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)\. For the Human Heuristic baseline, we directly adopt the manually tuned mixture configuration reported in\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10)\)to ensure fairness\. After the search stage, we use the mixture configurations obtained by each method to train a 1B\-parameter model on 25B tokens\. For evaluation, we focus on the accuracy of the pretrained model on a suite of downstream task benchmarks, including Social IQA\(Sapet al\.,[2019](https://arxiv.org/html/2606.14971#bib.bib53)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.14971#bib.bib54)\), PiQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.14971#bib.bib55)\), et\.al\. In addition, we also examine the time cost incurred by different methods during the search stage\.
Figure 2:Comparative evaluation of different data mixture strategies in the context of large\-scale pretraining, examining their impact on both downstream task performance and training efficiency\.Results\.As shown in Figure[2](https://arxiv.org/html/2606.14971#S4.F2), our proposed method,FastMix, demonstrates significant advantages in both downstream task performance and computational efficiency compared to existing data mixture strategies\. It achieves the highest average performance score of 48\.2 and the best average rank of 1 across all 14 downstream benchmarks, outperforming strong baselines including CLIMB \(47\.5\) and RegMix \(47\.2\)\. This top ranking underscores its consistent and robust generalization capabilities, further evidenced by its leading results on 9 of the 14 individual tasks\. Most notably,FastMixoffers a dramatic improvement in search efficiency, requiring only 1\.3 GPU\-hours to identify the optimal mixture\. This is orders of magnitude faster than other automated methods, such as CLIMB \(71\.9 GPU\-hours\) and RegMix \(720\.5 GPU\-hours\), validating the efficacy of our single proxy model and gradient\-based optimization approach\. Collectively, these results confirm thatFastMixnot only discovers superior data mixture configurations but also drastically reduces the computational overhead of the search process, offering a scalable and practical solution for large\-scale model training\.
### 4\.2Post\-training Stage Experiments
Setups\.Building on our pre\-training success, we next validatedFastMixin the post\-training stage, aiming to optimize data mixtures for specialized tasks on the Qwen2\.5\-Math\-Instruct 7B model\(Huiet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib91)\)\. For this study, we sourced supervised fine\-tuning \(SFT\) data from eight distinct domains, including Math \(OpenR1\-Math\-220k\(Open\-R1 Team,[2024](https://arxiv.org/html/2606.14971#bib.bib77)\)\), Code \(the programming\-related subset from the OpenThoughts\-114K\(Guhaet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib78)\)\), Dialogue \(ShareGPT\(RyokoAI,[2023](https://arxiv.org/html/2606.14971#bib.bib79)\)\), and STEM \(Platypus\(Leeet al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib80)\)\)\. Our optimization search objective was a 1:1 weighted sum of scores from two mathematical benchmarks, the simpler GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.14971#bib.bib85)\)and the more challenging gaokao2023en\(MARIO\-Math\-Reasoning,[2023](https://arxiv.org/html/2606.14971#bib.bib84)\)\. To evaluate the model’s generalization capabilities, we extended our test suite beyond math \(MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.14971#bib.bib81)\), AIME\-24\(Jia,[2024](https://arxiv.org/html/2606.14971#bib.bib82)\)\) to include tasks in coding \(LiveCodeBench\-v2\(Jianget al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib83)\)\) and STEM question\-answering \(GPQA\-Diamond\(Reinet al\.,[2023](https://arxiv.org/html/2606.14971#bib.bib86)\)\)\. A significant challenge in the post\-training setting is the absence of very small \(e\.g\., 10M parameter\) proxy models\. Therefore, we had to conduct our search using proxy models of approximately 1 billion parameters \(Qwen2\.5\-1\.5B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib92)\)\), with evaluation performed on larger models \(7B\)\. This constraint exposed a critical limitation of resource\-intensive methods\(Liuet al\.,[2024](https://arxiv.org/html/2606.14971#bib.bib10); Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\), which require training hundreds of proxy models\. Given the immense computational cost, our cluster was unable to support hundreds of full 1B\-model training runs, so we had to reduce the number of proxy models for both RegMix and CLIMB to just 64\. In contrast,FastMix’s reliance on a single proxy model enabled it to operate efficiently within these resource limitations, highlighting its superior scalability for larger\-scale tasks\.
Results\.In the post\-training \(SFT\) stage, the advantages ofFastMixare further solidified, demonstrating an even more dominant performance as shown in Figure[3](https://arxiv.org/html/2606.14971#S4.F3)\. Our method achieved the highest score across all four benchmarks spanning mathematics, coding, and general question\-answering, resulting in a superior average performance of 65\.4 and a top rank of 1, by a significant 5\.5 point lead over the next best method, CLIMB \(59\.9\)\(Diaoet al\.,[2025](https://arxiv.org/html/2606.14971#bib.bib52)\)\. Crucially, these results highlight the exceptional generalization capability ofFastMix\. While all automated methods used performance on mathematics benchmarks \(GSM8K and gaokao2023en\) as the guidance signal for optimization,FastMixnot only excelled in the math domain but also achieved the best performance on LiveCodeBench \(coding\) and GPQA\-Diamond \(STEM QA\)\. This strongly indicates that the data mixture identified byFastMixavoids overfitting to the optimization signal and instead fosters a more fundamental and comprehensive improvement in the model’s capabilities, all while maintaining remarkable efficiency by completing its search in just 2\.2 GPU hours, substantially faster than RegMix \(115\.9 hours\) and CLIMB \(117\.4 hours\)\.
Figure 3:Comparative evaluation of different data mixture strategies in the context of large\-scale post\-training \(SFT\), examining the efficiency and downstream task performance\.
### 4\.3Tip: The painful lesson of no free lunch
In this sub\-section, we conducted some very necessary discussions\. Some of the conclusions are derived from the experience in the industrial development process and may be quite different from the simple and clean conclusions obtained from academic data sets\.
Non\-differentiable targets\.Our optimization algorithm is designed for settings where bothℒtarget\\mathcal\{L\}\_\{\\text\{target\}\}andℒtrain\\mathcal\{L\}\_\{\\text\{train\}\}are differentiable\. However, in practice, non\-differentiable situations may arise\. We discuss two representative cases below\. One common challenge arises when the objective function is non\-differentiable, such as when validation performance is measured by discrete metrics \(e\.g\., accuracy\) rather than a smooth loss\. In such cases, we propose using a differentiable proxy objective, for instance, the supervised fine\-tuning \(SFT\) loss for question\-answering tasks, which provides a smooth surrogate while remaining aligned with the discrete evaluation metric\. This approach has proven to be highly effective in practice\.
Black\-box gradient estimators\.We conducted extensive experiments, and the results indicate that it is highly challenging to estimate gradients for non\-differentiable metrics using methods like finite differences or Simultaneous Perturbation Stochastic Approximation \(SPSA\)\. Convergence is rarely achieved, particularly on industrial datasets\. We attribute this difficulty to two primary reasons\. First, SPSA relies heavily on hyperparameter tuning for gradient estimation, and its estimation accuracy is inherently poor\. Second, while finite differences depend on introducing small perturbations to the parameters, non\-differentiable metrics often require substantial perturbations to show even marginal changes\. This renders the gradient estimates extremely noisy\. Furthermore, the finite difference method requires perturbing each source individually; this process is highly inefficient and fails to scale to a large number of sources\. Consequently, we suggest exercising extreme caution when considering black\-box metrics as optimization objectives for FastMix\.
Long outer\-loop horizons\.Another challenge arises when the outer\-loop duration parametern2n\_\{2\}is greater than one\. In this case, computing the gradient of the mixture weights becomes intractable\. Without constraints onn2n\_\{2\}, one would either need to rely on built\-in mechanisms in PyTorch\(Paszkeet al\.,[2019](https://arxiv.org/html/2606.14971#bib.bib50)\), such as backpropagation\-through\-time \(BPTT\), which quickly becomes prohibitively memory\-intensive in large\-model settings, or fall back on general gradient\-estimation techniques such as finite differences, which again are slow and unstable\. To avoid these pitfalls, we restrictn2=1n\_\{2\}=1whenever possible, which not only yields a closed\-form gradient but also delivers the most stable and efficient optimization behavior\.
About the regularization terms\.On simple and clean academic datasets, such a straightforward approach can be considered to prevent the optimization from collapsing onto just one or a few sources, which is a common issue in most current data\-mixing algorithms\. However, our extensive development experience with industrial data indicates that regularization terms may not be particularly effective\. Instead, the most robust solution is to enforce strict oversampling ratio constraints across all sources \(for instance, capping the up\-sampling at three times the original size\)\.
About the small proxy model\.In industrial scenarios, caution should be exercised when relying on small surrogate models \(smaller than 0\.5B\) to determine hyperparameters, such as data\-mixing ratios\. Based on our extensive experimentation with industrial data, small surrogate models exhibit significant limitations\. First, they suffer from convergence instability, which often yields highly noisy mixing ratios; this issue appears inherently tied to model scale rather than the algorithm itself, as we observed the same phenomenon even when using RegMix as an oracle\. Second, discrepancies in model capacity and architecture naturally lead to distinct biases toward different data sources\.
About the search target data\.In this study, we adhere to the experimental setup of RegMix, utilizing the loss on the Pile\-cc validation set as our optimization target\. In industrial development, however, practitioners typically maintain proprietary validation sets distinct from the test set\. As suggested previously, open\-ended questions within these sets can be formulated into SFT data to compute SFT loss\. Crucially, we identify a major bottleneck in pre\-training: pre\-training sequences are typically long, whereas SFT data is significantly shorter\. This structural discrepancy causes the gradients computed on these two data types to diverge drastically, ultimately leading to the failure of FastMix\. To mitigate this issue, a straightforward yet highly effective solution is to concatenate multiple SFT sequences to align their lengths with the pre\-training data\.
## 5Conclusion
We introducedFastMix, an efficient framework for discovering data mixtures for large\-model training\. Our key contribution is a weighted bilevel reformulation of mixture selection: via a reparameterization, optimizing sampling ratios becomes equivalent to learning per\-source loss weights, enabling mixture coefficients to be differentiable\. This permits joint, gradient\-based optimization of both the model and the mixture using a single proxy model rather than hundreds\. Across pre\-training and post\-training, FastMix delivers superior accuracy with orders\-of\-magnitude lower search cost, making data mixture optimization practical, scalable, and robust for next\-generation LLMs\.
## 6Future Works
FastMix also exhibits certain limitations and areas for future exploration\. First, its current one\-step, short\-horizon outer\-loop update mechanism introduces a degree of greediness, making the algorithm somewhat sensitive to data noise\. Second, we observed intriguing search dynamics during the optimization process: many data sources exhibit a competitive, time\-evolving relationship\. Certain sources prove vital in the early stages, whereas the most critical sources dominate only after prolonged training\. This phenomenon offers valuable insights into data curriculum design for large\-scale model training\. Consequently, we believe FastMix can be extended beyond data mixing to serve as a powerful framework for data source attribution\. We highly welcome community interest and invite collaboration and further discussion\.
## References
- A\. Albalak, L\. Pan, C\. Raffel, and W\. Y\. Wang \(2023\)Efficient online data mixing for language model pre\-training\.arXiv preprint arXiv:2312\.02406\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p3.1),[§4](https://arxiv.org/html/2606.14971#S4.p1.1)\.
- L\. B\. Allal, A\. Lozhkov, E\. Bakouch, L\. von Werra, and T\. Wolf \(2024\)SmolLM \- blazingly fast and remarkably powerful\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§4\.1](https://arxiv.org/html/2606.14971#S4.SS1.p1.1)\.
- M\. F\. Chen, M\. Y\. Hu, N\. Lourie, K\. Cho, and C\. Ré \(2024\)Aioli: a unified optimization framework for language model data mixing\.arXiv preprint arXiv:2411\.05735\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p3.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- S\. Diao, Y\. Yang, Y\. Fu, X\. Dong, D\. Su, M\. Kliegl, Z\. Chen, P\. Belcak, Y\. Suhara, H\. Yin, M\. Patwary, C\. Lin, J\. Kautz, and P\. Molchanov \(2025\)CLIMB: clustering\-based iterative data mixture bootstrapping for language model pre\-training\.arXiv preprint\.External Links:[Link](https://arxiv.org/abs/2504.13161)Cited by:[Figure 1](https://arxiv.org/html/2606.14971#S0.F1),[§1](https://arxiv.org/html/2606.14971#S1.p2.1),[§1](https://arxiv.org/html/2606.14971#S1.p4.4),[§2](https://arxiv.org/html/2606.14971#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.14971#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p2.1),[§4](https://arxiv.org/html/2606.14971#S4.p1.1)\.
- G\. Dong, H\. Yuan, K\. Lu, C\. Li, M\. Xue, D\. Liu, W\. Wang, Z\. Yuan, C\. Zhou, and J\. Zhou \(2023\)How abilities in large language models are affected by supervised fine\-tuning data composition\.arXiv preprint arXiv:2310\.05492\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- S\. Fan, M\. Pagliardini, and M\. Jaggi \(2023\)Doge: domain reweighting with generalization estimation\.arXiv preprint arXiv:2310\.15393\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- L\. Franceschi, P\. Frasconi, S\. Salzo, and M\. Pontil \(2018\)Bilevel programming for hyperparameter optimization and meta\-learning\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p3.1),[§3\.3](https://arxiv.org/html/2606.14971#S3.SS3.p1.1)\.
- L\. Gao, S\. Biderman, S\. Black, L\. Golding, T\. Hoppe, C\. Foster, J\. Phang, H\. He, A\. Thite, N\. Nabeshima, S\. Presser, and C\. Leahy \(2020\)The Pile: an 800gb dataset of diverse text for language modeling\.arXiv preprint arXiv:2101\.00027\.Cited by:[§4\.1](https://arxiv.org/html/2606.14971#S4.SS1.p1.1)\.
- C\. Ge, Z\. Ma, D\. Chen, Y\. Li, and B\. Ding \(2024\)Data mixing made efficient: a bivariate scaling law for language model pretraining\.arXiv preprint arXiv:2405\.14908\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- E\. Guha, R\. Marten, S\. Keh, N\. Raoof, G\. Smyrnis, H\. Bansal, M\. Nezhurina, J\. Mercat, T\. Vu, Z\. Sprague, A\. Suvarna, B\. Feuer, L\. Chen, Z\. Khan, E\. Frankel, S\. Grover, C\. Choi, N\. Muennighoff, S\. Su, W\. Zhao, J\. Yang, S\. Pimpalgaonkar, K\. Sharma, C\. C\. Ji, Y\. Deng, S\. Pratt, V\. Ramanujan, J\. Saad\-Falcon, J\. Li, A\. Dave, A\. Albalak, K\. Arora, B\. Wulfe, C\. Hegde, G\. Durrett, S\. Oh, M\. Bansal, S\. Gabriel, A\. Grover, K\. Chang, V\. Shankar, A\. Gokaslan, M\. A\. Merrill, T\. Hashimoto, Y\. Choi, J\. Jitsev, R\. Heckel, M\. Sathiamoorthy, A\. G\. Dimakis, and L\. Schmidt \(2025\)OpenThoughts: data recipes for reasoning models\.External Links:2506\.04178,[Link](https://arxiv.org/abs/2506.04178)Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- S\. Gunasekar, Y\. Zhang, J\. Aneja, C\. C\. T\. Mendes, A\. Del Giorno, S\. Gopi, M\. Javaheripi, P\. Kauffmann, G\. de Rosa, O\. Saarikivi,et al\.\(2023\)Textbooks are all you need\.arXiv preprint arXiv:2306\.11644\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- Z\. He, T\. Liang, J\. Xu, Q\. Liu, X\. Chen, Y\. Wang, L\. Song, D\. Yu, Z\. Liang, W\. Wang,et al\.\(2025\)Deepmath\-103k: a large\-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning\.arXiv preprint arXiv:2504\.11456\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- S\. Hu, Y\. Tu, X\. Han, C\. He, G\. Cui, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, W\. Zhao,et al\.\(2024\)Minicpm: unveiling the potential of small language models with scalable training strategies\.arXiv preprint arXiv:2404\.06395\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p1.1)\.
- B\. Hui, B\. Yang, Z\. Cui, C\. Li, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu,et al\.\(2024\)Qwen2\.5\-math technical report: toward mathematical expert model via self\-improvement\.arXiv preprint arXiv:2409\.12122\.Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- M\. Jia \(2024\)AIME\_2024\.Hugging Face\.Note:[https://huggingface\.co/datasets/Maxwell\-Jia/AIME\_2024](https://huggingface.co/datasets/Maxwell-Jia/AIME_2024)Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- C\. Jiang, S\. Dooley, C\. White, M\. Jin, Y\. Shen, D\. Shi, R\. Zheng, D\. Chen, Y\. Zhang, Y\. Li,et al\.\(2024\)LiveCodeBench: holistic and contamination free evaluation of large language models for code\.arXiv preprint arXiv:2403\.07974\.Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- F\. Kang, Y\. Sun, B\. Wen, S\. Chen, D\. Song, R\. Mahmood, and R\. Jia \(2024\)Autoscale: scale\-aware data mixing for pre\-training llms\.arXiv preprint arXiv:2407\.20177\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p2.1)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p3.1),[§3\.3](https://arxiv.org/html/2606.14971#S3.SS3.p2.10),[4](https://arxiv.org/html/2606.14971#alg1.l4.1),[7](https://arxiv.org/html/2606.14971#alg1.l7.1)\.
- P\. W\. Koh and P\. Liang \(2017\)Understanding black\-box predictions via influence functions\.InInternational conference on machine learning,pp\. 1885–1894\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p3.1)\.
- A\. N\. Lee, C\. J\. Hunter, and N\. Ruiz \(2023\)Platypus: quick, cheap, and powerful refinement of llms\.arXiv preprint arXiv:2308\.07317\.Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- H\. Liu, K\. Simonyan, and Y\. Yang \(2018\)Darts: differentiable architecture search\.arXiv preprint arXiv:1806\.09055\.Cited by:[§3\.3](https://arxiv.org/html/2606.14971#S3.SS3.p1.1)\.
- Q\. Liu, X\. Zheng, N\. Muennighoff, G\. Zeng, L\. Dou, T\. Pang, J\. Jiang, and M\. Lin \(2024\)Regmix: data mixture as regression for language model pre\-training\.arXiv preprint arXiv:2407\.01492\.Cited by:[Figure 1](https://arxiv.org/html/2606.14971#S0.F1),[§1](https://arxiv.org/html/2606.14971#S1.p2.1),[§1](https://arxiv.org/html/2606.14971#S1.p4.4),[§2](https://arxiv.org/html/2606.14971#S2.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.14971#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1),[§4](https://arxiv.org/html/2606.14971#S4.p1.1)\.
- D\. Maclaurin, D\. Duvenaud, and R\. Adams \(2015\)Gradient\-based hyperparameter optimization through reversible learning\.InInternational Conference on Machine Learning,pp\. 2113–2122\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p3.1),[§3\.3](https://arxiv.org/html/2606.14971#S3.SS3.p1.1)\.
- MARIO\-Math\-Reasoning \(2023\)Gaokao2023\-Math\-En: English Translation of Chinese Gaokao 2023 Mathematics Problems\.Hugging Face\.Note:[https://huggingface\.co/datasets/MARIO\-Math\-Reasoning/Gaokao2023\-Math\-En](https://huggingface.co/datasets/MARIO-Math-Reasoning/Gaokao2023-Math-En)Accessed: 2025\-09\-24Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- C\. Ming, C\. Qu, M\. Cai, Q\. Pei, Z\. Pan, Y\. Li, X\. Duan, L\. Wu, and C\. He \(2025\)IDEAL: data equilibrium adaptation for multi\-capability language model alignment\.arXiv preprint arXiv:2505\.12762\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p3.1),[§4](https://arxiv.org/html/2606.14971#S4.p1.1)\.
- Open\-R1 Team \(2024\)OpenR1\-Math\-220k: A Large\-Scale Dataset for Mathematical Reasoning\.Hugging Face\.Note:[https://huggingface\.co/datasets/open\-r1/OpenR1\-Math\-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k)Accessed: 2024\-06\-14Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga,et al\.\(2019\)Pytorch: an imperative style, high\-performance deep learning library\.Advances in neural information processing systems32\.Cited by:[§4\.3](https://arxiv.org/html/2606.14971#S4.SS3.p4.3)\.
- F\. Pedregosa \(2016\)Hyperparameter optimization with approximate gradient\.InInternational Conference on Machine Learning,Cited by:[§3\.3](https://arxiv.org/html/2606.14971#S3.SS3.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- D\. Rein, A\. Gudibande, J\. Petty, N\. Balepur, H\. Owhadi, E\. Jones, Y\. Li, S\. Brown, J\. Burnside, K\. Michael, J\. Albrecht, S\. R\. Bowman, B\. Christian, S\. Hammond, A\. Pilipiszyn, J\. Seares, J\. L\. Taylor, and W\. Saunders \(2023\)GPQA: a graduate\-level google\-proof qa benchmark\.arXiv preprint arXiv:2311\.12022\.Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- RyokoAI \(2023\)ShareGPT52K\.Hugging Face\.External Links:[Link](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)Cited by:[§4\.2](https://arxiv.org/html/2606.14971#S4.SS2.p1.1)\.
- M\. Sap, H\. Rashkin, D\. Chen, R\. LeBras, and Y\. Choi \(2019\)Socialiqa: commonsense reasoning about social interactions\.arXiv preprint arXiv:1904\.09728\.Cited by:[§4\.1](https://arxiv.org/html/2606.14971#S4.SS1.p1.1)\.
- M\. Shukor, L\. Bethune, D\. Busbridge, D\. Grangier, E\. Fini, A\. El\-Nouby, and P\. Ablin \(2025\)Scaling laws for optimal data mixtures\.arXiv preprint arXiv:2507\.09404\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p2.1)\.
- S\. Tong, E\. Brown, P\. Wu, S\. Woo, M\. Middepogu, S\. C\. Akula, J\. Yang, S\. Yang, A\. Iyer, X\. Pan, A\. Wang, R\. Fergus, Y\. LeCun, and S\. Xie \(2024\)Cambrian\-1: A Fully Open, Vision\-Centric Exploration of Multimodal LLMs\.arXiv preprint arXiv:2406\.16860\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p2.1),[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- S\. M\. Xie, H\. Pham, X\. Dong, N\. Du, H\. Liu, Y\. Lu, P\. S\. Liang, Q\. V\. Le, T\. Ma, and A\. W\. Yu \(2024\)Doremi: optimizing data mixtures speeds up language model pretraining\.Advances in Neural Information Processing Systems36\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p2.1),[§4](https://arxiv.org/html/2606.14971#S4.p1.1)\.
- A\. Yang, B\. Xiao, B\. Wang, B\. Zhang, C\. Bian, C\. Yin, C\. Lv, D\. Pan, D\. Wang, D\. Yan,et al\.\(2023\)Baichuan 2: open large\-scale language models\.arXiv preprint arXiv:2309\.10305\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p2.1),[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- A\. Yang, B\. Yang, B\. Hui, B\. Zheng, B\. Yu, C\. Zhou, C\. Li, C\. Li, D\. Liu, F\. Huang, G\. Dong, H\. Wei, H\. Lin, J\. Tang, J\. Wang, J\. Yang, J\. Tu, J\. Zhang, J\. Ma, J\. Yang, J\. Xu, J\. Zhou, J\. Bai, J\. He, J\. Lin, K\. Dang, K\. Lu, K\. Chen, K\. Yang, M\. Li, M\. Xue, N\. Ni, P\. Zhang, P\. Wang, R\. Peng, R\. Men, R\. Gao, R\. Lin, S\. Wang, S\. Bai, S\. Tan, T\. Zhu, T\. Li, T\. Liu, W\. Ge, X\. Deng, X\. Zhou, X\. Ren, X\. Zhang, X\. Wei, X\. Ren, X\. Liu, Y\. Fan, Y\. Yao, Y\. Zhang, Y\. Wan, Y\. Chu, Y\. Liu, Z\. Cui, Z\. Zhang, Z\. Guo, and Z\. Fan \(2024a\)Qwen2 technical report\.External Links:2407\.10671,[Link](https://arxiv.org/abs/2407.10671)Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024b\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§1](https://arxiv.org/html/2606.14971#S1.p1.1),[§2](https://arxiv.org/html/2606.14971#S2.p1.1)\.
- J\. Ye, P\. Liu, T\. Sun, J\. Zhan, Y\. Zhou, and X\. Qiu \(2024\)Data mixing laws: optimizing data mixtures by predicting language modeling performance\.arXiv preprint arXiv:2403\.16952\.Cited by:[§2](https://arxiv.org/html/2606.14971#S2.p2.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[§4\.1](https://arxiv.org/html/2606.14971#S4.SS1.p1.1)\.Similar Articles
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.
RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
RegMix-D extends RegMix to dynamic data mixing by using loss trajectories from proxy runs to predict optimal mixtures at multiple training stages, achieving improvements over static methods.
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
Mix-MoE: Improving Multilingual Machine Translation of Large Language Models through Mixed MoEs
Mix-MoE proposes a mixed Mixture-of-Experts framework with specialized expert groups and Fourier-transform-enhanced routing to mitigate parameter interference in multilingual machine translation, achieving significant improvements over baselines.
GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training
The paper proposes GAC, a noise-aware adaptive mixing controller for hybrid SFT-RL post-training of LLMs. It derives a closed-form mixing weight that balances gradient noise and SFT-RL disagreement, achieving consistent improvements across multiple benchmarks with minimal overhead.