CausalMix: Data Mixture as Causal Inference for Language Model Training

Hugging Face Daily Papers Papers

Summary

CausalMix formulates data mixture optimization as a causal inference problem for LLM training, enabling dynamic adaptation to shifting data distributions without costly retraining, and demonstrates improved performance on Qwen2.5-0.5B and Qwen3-4B-Base.

In Large Language Model (LLM) training, data mixing plays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlying data pool shifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to larger data pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as a causal inference problem. We formulate the statistical features of the data pool as covariates and the domain mixture as the treatment. After fitting a causal model on 512 runs of Qwen2.5-0.5B to estimate the Conditional Average Treatment Effect (CATE), we extrapolate the optimal mixture for an 800K data pool and apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data on Qwen3-4B-Base. By leveraging causal modeling to isolate confounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperforming RegMix and other baselines. In addition, we use the CATE Interpreter to provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.
Original Article
View Cached Full Text

Cached at: 07/02/26, 03:46 AM

Paper page - CausalMix: Data Mixture as Causal Inference for Language Model Training

Source: https://huggingface.co/papers/2607.01104

Abstract

CausalMix addresses limitations in LLM data mixing by formulating mixture optimization as a causal inference problem, enabling dynamic adaptation to shifting data distributions without costly retraining.

In Large Language Model (LLM) training,data mixingplays a pivotal role in determining model performance. Recent methods optimize mixture weights via proxy models, but they rely on the assumption of static data distributions. As a result, when the underlyingdata poolshifts, these methods require costly retraining from scratch. This limitation restricts their ability to scale seamlessly from small settings to largerdata pools and model sizes. In this paper, we propose CausalMix to address this limitation by casting data mixture optimization as acausal inferenceproblem. We formulate the statistical features of thedata poolascovariatesand the domain mixture as thetreatment. After fitting a causal model on 512 runs ofQwen2.5-0.5Bto estimate theConditional Average Treatment Effect(CATE), we extrapolate the optimal mixture for an 800Kdata pooland apply it to train a 7B model. Furthermore, we successfully generalize the framework to long chain-of-thought data onQwen3-4B-Base. By leveragingcausal modelingto isolateconfounding biases, CausalMix dynamically infers state-dependent optimal data mixtures. Extensive experiments show that the mixture guided by CausalMix consistently improves performance across multiple downstream tasks, outperformingRegMixand other baselines. In addition, we use theCATE Interpreterto provide visual analysis of the learned mixing strategy. Overall, CausalMix offers a causal and interpretable framework for optimizing LLM data mixtures.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2607\.01104

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2607.01104 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2607.01104 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.01104 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv cs.CL

This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

arXiv cs.CL

This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Hugging Face Daily Papers

CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.