CausalMix: Data Mixture as Causal Inference for Language Model Training
Summary
CausalMix formulates data mixture optimization as a causal inference problem for LLM training, enabling dynamic adaptation to shifting data distributions without costly retraining, and demonstrates improved performance on Qwen2.5-0.5B and Qwen3-4B-Base.
View Cached Full Text
Cached at: 07/02/26, 03:46 AM
Paper page - CausalMix: Data Mixture as Causal Inference for Language Model Training
Source: https://huggingface.co/papers/2607.01104
Abstract
CausalMix addresses limitations in LLM data mixing by formulating mixture optimization as a causal inference problem, enabling dynamic adaptation to shifting data distributions without costly retraining.
In Large Language Model (LLM) training,data mixingplays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlyingdata poolshifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to largerdata pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as acausal inferenceproblem. We formulate the statistical features of thedata poolascovariatesand the domain mixture as thetreatment. After fitting a causal model on 512 runs ofQwen2.5-0.5Bto estimate theConditional Average Treatment Effect(CATE), we extrapolate the optimal mixture for an 800Kdata pooland apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data onQwen3-4B-Base. By leveragingcausal modelingto isolateconfounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperformingRegMixand other baselines. In addition, we use theCATE Interpreterto provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2607\.01104
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2607.01104 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2607.01104 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.01104 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.
FastMix: Fast Data Mixture Optimization via Gradient Descent
FastMix is a novel framework that automates data mixture discovery for training large models using a single proxy model and bilevel optimization, achieving state-of-the-art performance with significant efficiency gains.
RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
RegMix-D extends RegMix to dynamic data mixing by using loss trajectories from proxy runs to predict optimal mixtures at multiple training stages, achieving improvements over static methods.
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.