Counterfactual Residual Data Augmentation for Regression

arXiv cs.LG 06/30/26, 04:00 AM Papers

data-augmentation regression counterfactual tabular-data machine-learning noise-reduction

Summary

Proposes Counterfactual Residual Data Augmentation (CRDA) for tabular regression, leveraging residual invariance under feature perturbations to generate realistic training samples, achieving significant MSE reduction on benchmarks.

arXiv:2606.28460v1 Announce Type: new Abstract: Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. Inspired by the impact of data augmentation in vision and language, we propose a novel Counterfactual Residual Data Augmentation (CRDA) technique for tabular regression. Our key insight is that once a regressor has modeled the systematic component of the data, the remaining noise can be viewed as an invariant residual that remains stable under small perturbations of carefully selected features. We exploit this residual invariance to generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real data. Our method is model-agnostic and readily applicable to various types of regressors. In experiments across datasets from a variety of benchmark repositories, on average, CRDA reduces an MLP Regressor's MSE by 22.9% and an XGBoost Regressor's MSE by 6.4%. When compared to existing state-of-the-art data generators and augmentation techniques, CRDA consistently outperforms in MSE reduction. By adding principled counterfactual variations to the training data, our method offers a simple and efficient remedy for noise-prone, small-sample regression settings.

Original Article

View Cached Full Text

Cached at: 06/30/26, 05:27 AM

# Counterfactual Residual Data Augmentation for Regression
Source: [https://arxiv.org/abs/2606.28460](https://arxiv.org/abs/2606.28460)
[View PDF](https://arxiv.org/pdf/2606.28460)

> Abstract:Data\-driven modeling in real\-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations\. Inspired by the impact of data augmentation in vision and language, we propose a novel Counterfactual Residual Data Augmentation \(CRDA\) technique for tabular regression\. Our key insight is that once a regressor has modeled the systematic component of the data, the remaining noise can be viewed as an invariant residual that remains stable under small perturbations of carefully selected features\. We exploit this residual invariance to generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real data\. Our method is model\-agnostic and readily applicable to various types of regressors\. In experiments across datasets from a variety of benchmark repositories, on average, CRDA reduces an MLP Regressor's MSE by 22\.9% and an XGBoost Regressor's MSE by 6\.4%\. When compared to existing state\-of\-the\-art data generators and augmentation techniques, CRDA consistently outperforms in MSE reduction\. By adding principled counterfactual variations to the training data, our method offers a simple and efficient remedy for noise\-prone, small\-sample regression settings\.

## Submission history

From: Hossein Mohebbi \[[view email](https://arxiv.org/show-email/df812ac2/2606.28460)\] **\[v1\]**Fri, 26 Jun 2026 13:04:37 UTC \(563 KB\)

Counterfactual Residual Data Augmentation for Regression

Similar Articles

Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias

R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

Submit Feedback

Similar Articles

Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias

R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting