Counterfactual Residual Data Augmentation for Regression

arXiv cs.LG Papers

Summary

Proposes Counterfactual Residual Data Augmentation (CRDA) for tabular regression, leveraging residual invariance under feature perturbations to generate realistic training samples, achieving significant MSE reduction on benchmarks.

arXiv:2606.28460v1 Announce Type: new Abstract: Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel Counterfactual Residual Data Augmentation (CRDA) technique for tabular regression. Our key insight is that once a regressor has modeled the systematic component of the data, the remaining noise can be viewed as an invariant residual that remains stable under small perturbations of carefully selected features. We exploit this residual invariance to generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real data. Our method is model-agnostic and readily applicable to various types of regressors. In experiments across datasets from a variety of benchmark repositories, on average, CRDA reduces an MLP Regressor's MSE by 22.9% and an XGBoost Regressor's MSE by 6.4%. When compared to existing state-of-the-art data generators and augmentation techniques, CRDA consistently outperforms in MSE reduction. By adding principled counterfactual variations to the training data, our method offers a simple and efficient remedy for noise-prone, small-sample regression settings.
Original Article
View Cached Full Text

Cached at: 06/30/26, 05:27 AM

# Counterfactual Residual Data Augmentation for Regression
Source: [https://arxiv.org/abs/2606.28460](https://arxiv.org/abs/2606.28460)
[View PDF](https://arxiv.org/pdf/2606.28460)

> Abstract:Data\-driven modeling in real\-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations\. Inspired by the impact of data augmentation in vision and language, we propose a novel Counterfactual Residual Data Augmentation \(CRDA\) technique for tabular regression\. Our key insight is that once a regressor has modeled the systematic component of the data, the remaining noise can be viewed as an invariant residual that remains stable under small perturbations of carefully selected features\. We exploit this residual invariance to generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real data\. Our method is model\-agnostic and readily applicable to various types of regressors\. In experiments across datasets from a variety of benchmark repositories, on average, CRDA reduces an MLP Regressor's MSE by 22\.9% and an XGBoost Regressor's MSE by 6\.4%\. When compared to existing state\-of\-the\-art data generators and augmentation techniques, CRDA consistently outperforms in MSE reduction\. By adding principled counterfactual variations to the training data, our method offers a simple and efficient remedy for noise\-prone, small\-sample regression settings\.

## Submission history

From: Hossein Mohebbi \[[view email](https://arxiv.org/show-email/df812ac2/2606.28460)\] **\[v1\]**Fri, 26 Jun 2026 13:04:37 UTC \(563 KB\)

Similar Articles

From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

arXiv cs.LG

Introduces Multi-Agent Residual In-Context Learning (MARICL), an agentic framework that uses LLM agents to analyze residuals from a base model on tabular data, hypothesize missing structure, and produce explicit correction terms via textual gradient optimization. Across nine benchmarks, MARICL consistently improves over its base model and demonstrates mechanistic generalization in cell-free protein predictions.

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

Hugging Face Daily Papers

Proposes REVES, a two-stage iterative framework that alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems.