Counterfactual Residual Data Augmentation for Regression
Summary
Proposes Counterfactual Residual Data Augmentation (CRDA) for tabular regression, leveraging residual invariance under feature perturbations to generate realistic training samples, achieving significant MSE reduction on benchmarks.
View Cached Full Text
Cached at: 06/30/26, 05:27 AM
# Counterfactual Residual Data Augmentation for Regression Source: [https://arxiv.org/abs/2606.28460](https://arxiv.org/abs/2606.28460) [View PDF](https://arxiv.org/pdf/2606.28460) > Abstract:Data\-driven modeling in real\-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations\. Inspired by the impact of data augmentation in vision and language, we propose a novel Counterfactual Residual Data Augmentation \(CRDA\) technique for tabular regression\. Our key insight is that once a regressor has modeled the systematic component of the data, the remaining noise can be viewed as an invariant residual that remains stable under small perturbations of carefully selected features\. We exploit this residual invariance to generate new, yet realistic, training samples, effectively expanding the dataset without requiring additional real data\. Our method is model\-agnostic and readily applicable to various types of regressors\. In experiments across datasets from a variety of benchmark repositories, on average, CRDA reduces an MLP Regressor's MSE by 22\.9% and an XGBoost Regressor's MSE by 6\.4%\. When compared to existing state\-of\-the\-art data generators and augmentation techniques, CRDA consistently outperforms in MSE reduction\. By adding principled counterfactual variations to the training data, our method offers a simple and efficient remedy for noise\-prone, small\-sample regression settings\. ## Submission history From: Hossein Mohebbi \[[view email](https://arxiv.org/show-email/df812ac2/2606.28460)\] **\[v1\]**Fri, 26 Jun 2026 13:04:37 UTC \(563 KB\)
Similar Articles
Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias
The paper introduces ReMatch, a method that aligns training residual distributions to test-time regimes via optimal transport in PCA space to mitigate bias in probabilistic downscaling, achieving better calibration and dispersion.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
Proposes R2R2, a regularization method for self-predictive learning in reinforcement learning to mitigate overfitting under high update-to-data ratios, achieving significant improvements on continuous control tasks.
From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data
Introduces Multi-Agent Residual In-Context Learning (MARICL), an agentic framework that uses LLM agents to analyze residuals from a base model on tabular data, hypothesize missing structure, and produce explicit correction terms via textual gradient optimization. Across nine benchmarks, MARICL consistently improves over its base model and demonstrates mechanistic generalization in cell-free protein predictions.
REVES: REvision and VErification--Augmented Training for Test-Time Scaling
Proposes REVES, a two-stage iterative framework that alternates between data augmentation and policy optimization to improve LLM reasoning by leveraging intermediate correction steps, achieving superior performance on coding benchmarks and constraint satisfaction problems.
RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting
RAFT is a two-stage framework for domain-specific fine-tuning of LLMs that addresses catastrophic forgetting by refining supervision data and using on-policy distillation with adaptive loss balancing, achieving significant improvements on domain accuracy while recovering general capabilities.