CausalMix: Data Mixture as Causal Inference for Language Model Training

Hugging Face Daily Papers 07/01/26, 12:00 AM Papers

causal-inference data-mixing language-model llm-training optimization qwen

Summary

CausalMix formulates data mixture optimization as a causal inference problem for LLM training, enabling dynamic adaptation to shifting data distributions without costly retraining, and demonstrates improved performance on Qwen2.5-0.5B and Qwen3-4B-Base.

In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlying data pool shifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to larger data pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as a causal inference problem. We formulate the statistical features of the data pool as covariates and the domain mixture as the treatment. After fitting a causal model on 512 runs of Qwen2.5-0.5B to estimate the Conditional Average Treatment Effect (CATE), we extrapolate the optimal mixture for an 800K data pool and apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data on Qwen3-4B-Base. By leveraging causal modeling to isolate confounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperforming RegMix and other baselines. In addition, we use the CATE Interpreter to provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.

Original Article

View Cached Full Text

Cached at: 07/02/26, 03:46 AM

Paper page - CausalMix: Data Mixture as Causal Inference for Language Model Training

Source: https://huggingface.co/papers/2607.01104

Abstract

CausalMix addresses limitations in LLM data mixing by formulating mixture optimization as a causal inference problem, enabling dynamic adaptation to shifting data distributions without costly retraining.

In Large Language Model (LLM) training,data mixingplays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlyingdata poolshifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to largerdata pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as acausal inferenceproblem. We formulate the statistical features of thedata poolascovariatesand the domain mixture as thetreatment. After fitting a causal model on 512 runs ofQwen2.5-0.5Bto estimate theConditional Average Treatment Effect(CATE), we extrapolate the optimal mixture for an 800Kdata pooland apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data onQwen3-4B-Base. By leveragingcausal modelingto isolateconfounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperformingRegMixand other baselines. In addition, we use theCATE Interpreterto provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2607\.01104

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2607.01104 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2607.01104 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.01104 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

CausalMix: Data Mixture as Causal Inference for Language Model Training

Paper page - CausalMix: Data Mixture as Causal Inference for Language Model Training

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

FastMix: Fast Data Mixture Optimization via Gradient Descent

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Submit Feedback

Similar Articles

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

FastMix: Fast Data Mixture Optimization via Gradient Descent

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists