DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

arXiv cs.LG 06/11/26, 04:00 AM Papers
debugging training-data data-quality influence-functions machine-learning data-cleaning
Summary
DeMix is a novel framework that detects erroneous training samples and identifies their specific error types (label errors, feature errors, spurious correlations) by analyzing influence vectors, achieving a 22.61% improvement in debugging F1-score and 9.32% gain in task performance after data repair.
arXiv:2606.11616v1 Announce Type: new Abstract: High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: https://github.com/SJTU-DMTai/DeMix.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:50 PM
# DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors
Source: [https://arxiv.org/html/2606.11616](https://arxiv.org/html/2606.11616)
\(2026\)

###### Abstract\.

High\-quality training data is essential for the success of machine learning models\. However, real\-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations\. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement\. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types\. Our key insight is that different error types produce distinct patterns on model behavior\. DeMix captures such error\-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples\. We formulate training data debugging as a multi\-label classification problem where a classifier is developed to predict error types directly from influence vectors\. We further introduce an intervention\-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively\. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state\-of\-the\-art approaches, achieving a 22\.61% improvement in data debugging F1\-score and a 9\.32% gain in task model performance after data repair\. Code is available at:[https://github\.com/SJTU\-DMTai/DeMix](https://github.com/SJTU-DMTai/DeMix)\.

Data Debugging; Data Attribution; Data Errors; Influence Function;

††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3817774††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Computing methodologies Neural networks††ccs:Information systems Data cleaning## 1\.Introduction

Data serves as the fundamental resource powering diverse machine learning applications, from recommendation systems\(Yin et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib40); Zhang et al\.,[2025b](https://arxiv.org/html/2606.11616#bib.bib43); Kersbergen et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib17)\)to large language model\-based applications\(Xia et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib38); Peng et al\.,[\[n\. d\.\]](https://arxiv.org/html/2606.11616#bib.bib28); Deng et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib8),[\[n\. d\.\]](https://arxiv.org/html/2606.11616#bib.bib9); Weng et al\.,[2026](https://arxiv.org/html/2606.11616#bib.bib35); Zhao et al\.,[2026](https://arxiv.org/html/2606.11616#bib.bib44)\)\. Training data quality has emerged as the primary determinant of model performance, establishing a new data\-centric paradigm in machine learning deployment\(Liang et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib23)\)\. However, preparing data for model training involves a multi\-stage process that typically encompasses data collection, transformation, feature engineering, and labeling\(Liang et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib23)\)\. Each stage may contain systematic flaws that introduce different types of errors into the final training dataset\. Common error types include mislabeled samples caused by ambiguous annotation guidelines\(Myrtakis et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib26); Deng et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib10); Kong et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib19)\), corrupted features resulting from bugs in feature processing systems\(Myrtakis et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib26); Ding et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib11)\), and spurious correlations arising from selection bias or confounding variables\(Ye et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib39); Wu et al\.,[2023](https://arxiv.org/html/2606.11616#bib.bib36); Chen et al\.,[2024a](https://arxiv.org/html/2606.11616#bib.bib4); Gao et al\.,[2026](https://arxiv.org/html/2606.11616#bib.bib12)\)\. When models are trained on data containing such mixed error types, they inevitably learn erroneous and biased patterns, resulting in unreliable predictions and significant deployment risks\. Therefore,*training data debugging*has become a crucial problem, which requires addressing two interconnected questions:*which training samples are erroneous*, and*what type of error do they contain*\. Answering both questions jointly provides essential implications for locating systematic flaws and fixing them at their source\.

Various efforts have been devoted to improving training data quality, which can be generally classified into two categories\. Data cleaning methods\(Chu et al\.,[2016](https://arxiv.org/html/2606.11616#bib.bib7); Siddiqi et al\.,[2023](https://arxiv.org/html/2606.11616#bib.bib30); Ding et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib11); Bao et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib3)\)primarily assume erroneous samples exhibit statistical deviations from clean data and can thus be flagged via distributional analysis, outlier detection, or consistency checks\. While effective for random errors and isolated anomalies, these methods are limited in identifying systematic errors in training data\. For instance, if a specific subgroup is consistently mislabeled due to incorrect labeling functions, none of them appear as anomalies relative to others in the same group, making data cleaning methods ineffective\. Recently, data attribution methods\(Deng et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib8); Hammoudeh and Lowd,[2024](https://arxiv.org/html/2606.11616#bib.bib13); Myrtakis et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib26); Kersbergen et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib17); Zhang et al\.,[2025b](https://arxiv.org/html/2606.11616#bib.bib43)\)have emerged as a promising alternative\. They use influence functions\(Koh and Liang,[2017](https://arxiv.org/html/2606.11616#bib.bib18); Kong et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib19); Hammoudeh and Lowd,[2024](https://arxiv.org/html/2606.11616#bib.bib13)\)to quantify how removing a training sample affects model performance on a validation set, flagging those that negatively impact the model as erroneous\. However, they mainly focus on identifying erroneous samples while leaving error type classification unsolved\. Additionally, accurate data attribution relies on the availability of clean and unbiased validation sets, which are difficult to obtain in many practical applications such as recommendation systems where validation data often suffers from the same systematic errors as the training data\. To these ends, existing methods are inadequate for comprehensive training data debugging\.

To address the problem, our key insight is that systematic training data errors often introduce consistent biases into model behavior, and these biases are reflected in how training samples influence model predictions across the entire validation set\. Critically, different error types induce qualitatively different patterns on the validation data through influence function\. For example, samples with label errors tend to exert negative influence on validation samples with similar features but correct labels\. In contrast, samples with spurious correlations often positively influence validation samples sharing spurious attributes while negatively influencing counterexamples that violate the spurious pattern\. Hence, we capture such error\-specific patterns by using the complete influence vector rather than aggregating influence into a single scalar\. Formally, for a training sampleziz\_\{i\}, the*influence vector*is defined asΦi=\[ϕi,1,…,ϕi,M\]\\Phi\_\{i\}=\[\\phi\_\{i,1\},\\dots,\\phi\_\{i,M\}\], whereMMis the validation set size\. Each entryϕi,j\\phi\_\{i,j\}measures the impact of removingziz\_\{i\}w\.r\.t\. the loss of validation samplezjz\_\{j\}, computed via influence functions\. Fig\.[1](https://arxiv.org/html/2606.11616#S1.F1)provides empirical evidence of the distinguishing power of influence vectors: for a training sample\{zi\}\\\{z\_\{i\}\\\}, we compare the t\-SNE embeddings of its influence vectorΦi\\Phi\_\{i\}and the raw features\{zi=\(xi,yi\)\}\\\{z\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\\}\. Across the three error types including label errors \(LE\), feature errors \(FE\), and spurious correlations \(SC\), the visualization shows that influence vectors successfully disentangle error\-specific clusters that remain mixed in the original data space\. We observe consistent clustering results on the other datasets, as detailed in Appendix[A](https://arxiv.org/html/2606.11616#A1)\.

Based on our insight, we develop a multi\-label classifier that takes the influence vectorΦi\\Phi\_\{i\}of each training samplezi∈𝒟tz\_\{i\}\\in\\mathcal\{D\}\_\{t\}as input and predicts a set of error types denoted ast^i\\hat\{\\rm t\}\_\{i\}\. Since influence vectors encode interactions with an unordered validation set, we employ a Set Transformer\(Lee et al\.,[2019](https://arxiv.org/html/2606.11616#bib.bib22)\)to encodeΦi\\Phi\_\{i\}into a low\-dimensional representation, which is then decoded by multiple MLP heads for the final prediction\. To facilitate supervised training of the classifier, we provide a controlled error injection strategy that generates synthetic datasets where selected training samples are deliberately corrupted and annotated with known error types\. Note that we do not assume that the validation set is perfectly clean or unbiased\. Since we focus on identifying characteristic patterns rather than relying on absolute values in the influence vectors, our method can tolerate noisy validation data as long as the patterns induced by different error types remain distinguishable\.

However, a key challenge in the proposed solution remains: influence vectors depend not only on the training data itself, but also on the configuration used for influence computation\. Both the choice of the validation set and the task model instance affect the resulting vectors, even when the underlying training data remains unchanged\. Without proper controls, the classifier may exploit configuration\-specific patterns that fail to generalize beyond the training setup\. To mitigate this issue, we introduce an invariant representation learning strategy that encourages the Set Transformer encoder to extract error\-specific patterns stable across different influence computation settings\.From an Information Bottleneck perspective\(Tishby et al\.,[2000](https://arxiv.org/html/2606.11616#bib.bib32); Alemi et al\.,[2017](https://arxiv.org/html/2606.11616#bib.bib2); Miao et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib25)\), this strategy learns a minimal sufficient representation that preserves error\-type semantics while filtering configuration\-specific noise \(detailed analysis in Appendix[C](https://arxiv.org/html/2606.11616#A3)\)\.This is achieved through two kinds of interventions\. First, we intervene on the validation set by computing multiple influence vectors with different randomly sampled validation subsets, and apply a contrastive loss to keep their representations close\. Second, we intervene on the task model by computing influence vectors with an ensemble of models that differ in architecture or initialization, and align their representations via a pairwise consistency loss\. Together, these losses force the encoder to focus on patterns that persist across configurations, improving generalization in training data debugging\.

In this paper, we presentDeMix, an automated framework designed for training dataDebugging withMixed error types\. The framework operates by first computing influence vectors for training samples and then feeding them into the classifier to predict erroneous samples along with their error types\. We evaluate DeMix on 11 tasks spanning tabular data prediction, recommendation systems, and LLM alignment\. The results demonstrate that DeMix significantly outperforms state\-of\-the\-art baselines, achieving a 22\.61% improvement in error type classification F1\-score and a 9\.32% gain in the task model performance with repaired training data\.

![Refer to caption](https://arxiv.org/html/2606.11616v1/x1.png)Figure 1\.t\-SNE visualization of \(a\) influence vectors and \(b\) raw features of erroneous samples, with three types of errors injected into the Adult dataset \(more results in Appendix[A](https://arxiv.org/html/2606.11616#A1)\)\.The main contributions of this paper are summarized as follows\.

- •We investigate the under\-explored problem of debugging training data with mixed error types, which requires identifying both erroneous samples and their corresponding error types\. We reveal that influence vectors effectively capture the distinct patterns through which different error types affect model predictions\.
- •We propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types by formulating the problem as multi\-label classification with influence vectors as inputs\. We further introduce intervention\-based training objectives that guide the classifier to capture invariant and error\-specific patterns\.
- •Extensive experiments across diverse machine learning tasks and task models demonstrate that DeMix outperforms state\-of\-the\-art baselines by accurately debugging data with mixed error types and improving task model performance through targeted repair tailored to different error types\.

## 2\.Preliminaries

We consider a standard supervised ML setting involving a task modelfθ:𝒳→𝒴f\_\{\\theta\}:\\mathcal\{X\}\\to\\mathcal\{Y\}that maps an inputxix\_\{i\}from the input space𝒳\\mathcal\{X\}to a labelyiy\_\{i\}in the label space𝒴\\mathcal\{Y\}, whereθ\\thetais the model parameters\. Given a training dataset𝒟t=\{zi=\(xi,yi\)\}i=1N\\mathcal\{D\}\_\{t\}=\\\{z\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}consisting ofNNsamples, the objective is to learn the optimal parametersθ∗\\theta^\{\*\}that minimize a predefined loss functionℓ\\ell, e\.g\., cross\-entropy or mean squared error\. Formally, we have:

\(1\)θ∗=arg⁡minθ⁡1N∑i=1Nℓ\(fθ\(xi\),yi\)\.\\theta^\{\*\}=\\arg\\min\_\{\\theta\}\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\ell\(f\_\{\\theta\}\(x\_\{i\}\),y\_\{i\}\)\.
Data Error Types\.In this paper, we consider the following three error types that are prevalent in real\-world training data\.

\(1\)Label Error \(LE\)\.Letyi∗y\_\{i\}^\{\*\}denote the ground\-truth label for training sampleziz\_\{i\}\. We define the set of samples with label errors as𝒟tLE=\{zi=\(xi,yi\)∣yi≠yi∗\}\\mathcal\{D\}\_\{t\}^\{\\rm LE\}=\\\{z\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\mid y\_\{i\}\\neq y\_\{i\}^\{\*\}\\\}\. Label errors are ubiquitous and typically arise from imperfect annotation processes, including human subjectivity in crowdsourced tasks and faults in automated labeling functions\(Liang et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib23)\)\.

\(2\)Feature Error \(FE\)\.Letxi∗x\_\{i\}^\{\*\}denote the ground\-truth feature of sampleziz\_\{i\}\. The set of samples with feature errors is denoted as𝒟tFE=\{zi=\(xi,yi\)∣xi≠xi∗\}\\mathcal\{D\}\_\{t\}^\{\\rm FE\}=\\\{z\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\mid x\_\{i\}\\neq x\_\{i\}^\{\*\}\\\}\. Feature errors manifest in various forms, such as missing values, outliers, and attribute dependency violations\. These anomalies typically arise from sporadic or systematic failures during data collection\(Ding et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib11)\)\.

\(3\)Spurious Correlation \(SC\)\.We first define a group annotationg:=\(y,a\)g:=\(y,a\)composed of a labely∈𝒴y\\in\\mathcal\{Y\}and a non\-causal spurious attributea∈𝒜a\\in\\mathcal\{A\}\(Ye et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib39)\)\. The training data can be partitioned into groups based ongg\. When certain groups have significantly smaller sample sizes, we identify them as underrepresented minority groups, denoted as𝒢min\\mathcal\{G\}\_\{\\rm min\}\. We focus on samples within these underrepresented groups, defined as𝒟tSC=\{zi=\(xi,yi\)∣\(yi,ai\)∈𝒢min\}\\mathcal\{D\}\_\{t\}^\{\\rm SC\}=\\\{z\_\{i\}=\(x\_\{i\},y\_\{i\}\)\\mid\(y\_\{i\},a\_\{i\}\)\\in\\mathcal\{G\}\_\{\\rm min\}\\\}\. Such an imbalance typically arises from sampling bias, causing models to learn spurious attributes as non\-causal shortcuts\(Ye et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib39)\)\.

###### Example 0\.

Fig\.[2](https://arxiv.org/html/2606.11616#footnote2)illustrates three data error types in the Adult dataset for the income prediction task\. \(1\) LEs stem from incorrect annotation: for instance,z1z\_\{1\}with the golden label “≤\\leq50K” is mislabeled as “\>\>50K”; \(2\)z3z\_\{3\}exhibits FE for violating the attribute dependency between “Sex” and “Relationship”; \(3\) Regarding SCs, with “Country” as a spurious attribute, majority groups \(“Country=A, Income\>\>50K” and “Country=B, Income≤\\leq50K”\) establish a non\-causal spurious correlation between “Country” and “Income”\. ML models learned through ERM typically exploit this shortcut, resulting in prediction failures on underrepresented samples likez9z\_\{9\}\(with “Country=B, Income\>\>50K”\)\.

Data with Mixed Error Types\.We consider training data𝒟t\\mathcal\{D\}\_\{t\}contaminated by a mixture of error types𝒯=\{LE,FE,SC\}\\mathcal\{T\}=\\\{\\rm LE,FE,SC\\\}\. Formally, the set of erroneous samples is defined as the union of individual error sets:𝒟tErr=𝒟tLE∪𝒟tFE∪𝒟tSC\\mathcal\{D\}^\{\\rm Err\}\_\{t\}=\\mathcal\{D\}^\{\\rm LE\}\_\{t\}\\cup\\mathcal\{D\}^\{\\rm FE\}\_\{t\}\\cup\\mathcal\{D\}^\{\\rm SC\}\_\{t\}\. This setting explicitly allows forerror co\-occurrence, meaning any single sample may simultaneously contain multiple types of errors\.

###### Definition 0 \(Training data debugging\)\.

Let𝒯=\{LE, FE, SC\}\\mathcal\{T\}=\\\{\\text\{LE, FE, SC\}\\\}be the set of potential error types\.*Training data debugging*aims to learn a mappingg:𝒳×𝒴→\{0,1\}3g:\\mathcal\{X\}\\times\\mathcal\{Y\}\\to\\\{0,1\\\}^\{3\}\. For each sampleziz\_\{i\},ggassigns a binary error vectort^i=\[t^i\(1\),…,t^i\(3\)\]\\hat\{\{\\rm t\}\}\_\{i\}=\[\\hat\{t\}\_\{i\}^\{\(1\)\},\\dots,\\hat\{t\}\_\{i\}^\{\(3\)\}\], wheret^i\(k\)=1\\hat\{\\rm t\}\_\{i\}^\{\(k\)\}=1indicates the presence of thekk\-th error type in𝒯\\mathcal\{T\}\. Clearly, the mapping function indicates both erroneous samples \(wheret^i≠𝟎\\hat\{\{\\rm t\}\}\_\{i\}\\neq\\mathbf\{0\}\) and their error types\.

Influence Function\.To quantify the impact of a training sampleziz\_\{i\}on a validation samplezjz\_\{j\}, the Leave\-One\-Out \(LOO\) score offers a straightforward influence by computing:LOO\(zi,zj\):=ℓ\(zj;θ^−i\)−ℓ\(zj;θ^\)\\text\{LOO\}\(z\_\{i\},z\_\{j\}\):=\\ell\(z\_\{j\};\\hat\{\\theta\}\_\{\-i\}\)\-\\ell\(z\_\{j\};\\hat\{\\theta\}\)\. However, computing exact LOO scores is computationally prohibitive as it requires retraining the model for each sample\. Consequently, Influence Functions\(Koh and Liang,[2017](https://arxiv.org/html/2606.11616#bib.bib18); Hammoudeh and Lowd,[2024](https://arxiv.org/html/2606.11616#bib.bib13); Hu et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib15)\)are widely adopted as a feasible approximation\. By estimating the validation loss change under an infinitesimal up\-weighting ofziz\_\{i\}, the influence score is derived as:

\(2\)ϕ\(zi,zj\):=−∇θ^ℓ\(zj;θ^\)⊤Hθ^−1∇θ^ℓ\(zi;θ^\),\\phi\(z\_\{i\},z\_\{j\}\):=\-\\nabla\_\{\\hat\{\\theta\}\}\\ell\(z\_\{j\};\\hat\{\\theta\}\)^\{\\top\}H\_\{\\hat\{\\theta\}\}^\{\-1\}\\nabla\_\{\\hat\{\\theta\}\}\\ell\(z\_\{i\};\\hat\{\\theta\}\),whereHθ^H\_\{\\hat\{\\theta\}\}is the empirical Hessian\. Intuitively, this approximates the parameter update via a Newton step and estimates the loss change using a first\-order Taylor expansion\(Choe et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib6)\)\. Nevertheless, inverse Hessian remains computationally expensive, particularly for models with massive parameter spaces, e\.g\., LLMs\. To ensure scalability, a common practice is to restrict the influence computation to the parameters of the final LLMs classification layer\(Deng et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib8); Xia et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib38); Hu et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib15)\)\.

![Refer to caption](https://arxiv.org/html/2606.11616v1/x2.png)Figure 2\.An example of three error types222Note that SC is a dataset\-level error, so we flag minority\-group samples \(e\.g\.,z9z\_\{9\}\) for the recognition of SC and group\-aware reweighting during repair\.\(Adult dataset\)\.
## 3\.Methodology

In this section, we present an automated training data debugging framework named DeMix\. We begin by analyzing why influence vectors can effectively distinguish training samples with different error types\. Building on this insight, we develop a parameterized classifier with a specialized training paradigm\. We then describe how these components integrate into a complete debugging pipeline\.

### 3\.1\.Influence Vector

Different data error types affect model training in distinct mechanisms\. Specifically, label errors corrupt supervision signals and distort decision boundaries; feature errors perturb input representations and the feature space geometry; spurious correlations encourage models to rely on non\-causal shortcuts\. Our key observation is that these mechanistic differences produce distinct behavior patterns on the validation set through influence functions\. A training sample with a specific error type tends to influence particular subsets of validation samples in a characteristic way\. For example, it may predominantly affect samples with similar features, samples near decision boundaries, or samples within the same spurious subgroup\. These structured patterns provide direct signals for distinguishing error types\.

To capture such patterns, we represent each training sample by its influence vector, which records its influence on every validation sample\. Formally, for a training sampleziz\_\{i\}, we define its influence vectorΦi∈ℝM\\Phi\_\{i\}\\in\\mathbb\{R\}^\{M\}as the concatenation of its influence functions over the validation set𝒟v\\mathcal\{D\}\_\{v\}:

\(3\)Φi:=\[ϕi,1,…,ϕi,M\],\\Phi\_\{i\}:=\[\\phi\_\{i,1\},\\dots,\\phi\_\{i,M\}\],whereM=\|𝒟v\|M=\|\\mathcal\{D\}\_\{v\}\|and each entryϕi,j=ϕ\(zi,zj\)\\phi\_\{i,j\}=\\phi\(z\_\{i\},z\_\{j\}\)is the influence function ofziz\_\{i\}w\.r\.t\. a validation samplezjz\_\{j\}\.

Prior data attribution methods\(Koh and Liang,[2017](https://arxiv.org/html/2606.11616#bib.bib18); Kong et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib19); Myrtakis et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib26); Xia et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib38)\)typically summarize influence as a single scalar value, collapsing structured patterns and obscuring error\-specific signals\. In contrast, the complete influence vector preserves both the magnitude and distribution of influence, making it effective for identifying different error types\.

###### Example 0\.

We illustrate the effectiveness of influence vectors using the examples in Fig\.[2](https://arxiv.org/html/2606.11616#footnote2)\. \(1\) LE: Sincez1z\_\{1\}violates feature\-label mapping, the model tends to memorize this anomaly\. Consequently, it misguides predictions on validation sampleszjz\_\{j\}that share similar features but hold the correct label\. Removingz1z\_\{1\}would significantly reduce the loss onzjz\_\{j\}, manifesting as a strongly negative influence value, whereas its impact on dissimilar samples remains negligible\. \(2\) FE: Sincez3z\_\{3\}violates the dependency between “Sex” and “Relationship”, the model treats it as an outlier isolated from class centers\. Such outliers exert a minor effect on the decision boundary\. Consequently, in the influence vector,z3z\_\{3\}exhibits fluctuating influence on validation samples very close to the boundary, while having a negligible impact on samples well within the class center\. \(3\) SC: Due to spurious correlations, the model may exploit “Country” as a shortcut to predict the “Income”\. In terms of influence, the underrepresented samplez9z\_\{9\}exerts a significant impact on validation samples within the same group, whereas its impact on samples from other groups is negligible\.We further present a case study on the Adult dataset in Appendix[B](https://arxiv.org/html/2606.11616#A2), qualitatively showing that influence vectors exhibit distinct patterns for different error types\.

### 3\.2\.Data Debugging via Influence Vectors

We reformulate training data debugging as a multi\-label classification task with influence vectors as inputs\. Formally, given the influence vectorΦi∈ℝM\\Phi\_\{i\}\\in\\mathbb\{R\}^\{M\}and the set fo error types𝒯=\{LE,FE,SC\}\\mathcal\{T\}=\\\{\\rm LE,FE,SC\\\}, we aim to learn a mapping functiongψ:ℝM→\{0,1\}3g\_\{\\psi\}:\\mathbb\{R\}^\{M\}\\to\\\{0,1\\\}^\{3\}\. For each samplezi∈𝒟tz\_\{i\}\\in\\mathcal\{D\}\_\{t\},gψg\_\{\\psi\}outputs a binary vectort^i∈\{0,1\}3\\hat\{\{\\rm t\}\}\_\{i\}\\in\\\{0,1\\\}^\{3\}, with each entry indicating the presence of the corresponding error type in𝒯\\mathcal\{T\}\. This formulation enables the detection of multiple error types within a single sampleziz\_\{i\}, i\.e\.,∑k=13t^i\(k\)≥2\\sum\_\{k=1\}^\{3\}\\hat\{\\rm t\}\_\{i\}^\{\(k\)\}\\geq 2\.

We then employ a Data Error Classifier \(DEC\) for this task\. The architectural design of DEC is driven by two requirements: \(1\)Permutation Invariance: the influence vector essentially represents an unordered set of scalar values, where the permutation of validation samples holds no semantic information, and \(2\)Element\-wise Interactions: The elements within an influence vector are not independent\. For instance, a training sample has similar influence values w\.r\.t\. similar validation samples\. Capturing such element\-wise interactions is crucial for distinguishing specific error types\. Hence, we employ the Set Transformer\(Lee et al\.,[2019](https://arxiv.org/html/2606.11616#bib.bib22)\)as the encoder of DEC\. Specifically, DEC maps the input setΦi\\Phi\_\{i\}to a low\-dimensional representationhi\{\\rm h\}\_\{i\}using a unified architecture that stacks a Set Attention Block \(SAB\) followed by Pooling by Multihead Attention \(PMA\) withlllearnable seed vectorsSS:

\(4\)hi=PMAl\(SAB\(Φi\)\)=MAB\(S,rFF\(MAB\(Φi,Φi\)\)\)\.\{\\rm h\}\_\{i\}=\\text\{PMA\}\_\{l\}\(\\text\{SAB\}\(\\Phi\_\{i\}\)\)=\\text\{MAB\}\(S,\\text\{rFF\}\(\\text\{MAB\}\(\\Phi\_\{i\},\\Phi\_\{i\}\)\)\)\.The MLP\-based heads parameterized byω\\omegathen maphi\{\\rm h\}\_\{i\}into predictions:𝐭^i=MLPω\(hi\)\\hat\{\\mathbf\{t\}\}\_\{i\}=\\text\{MLP\}\_\{\\omega\}\(\{\\rm h\}\_\{i\}\)\.

The core building block Multihead Attention Block \(MAB\) processes two input setsXXandYYthrough:

\(5\)MAB\(X,Y\)=LayerNorm\(H\+rFF\(H\)\),\\text\{MAB\}\(X,Y\)=\\text\{LayerNorm\}\(H\+\\text\{rFF\}\(H\)\),whereH=LayerNorm\(X\+Multihead\(X,Y,Y\)\)H=\\text\{LayerNorm\}\(X\+\\text\{Multihead\}\(X,Y,Y\)\)andrFFis a row\-wise feedforward layer\. Functionally,SABpreserves permutation equivariance to capture element\-wise interactions, whilePMAenforces permutation invariance by aggregating these equivariant features via attention against the fixed seed queriesSS\.

![Refer to caption](https://arxiv.org/html/2606.11616v1/x3.png)Figure 3\.Interventions on the validation set and task model\.
### 3\.3\.Training Paradigm

We formalize the training data for DEC as paired instancesℰ=\{ei=\(Φi,ti\)\}\|i=1N\\mathcal\{E\}=\\\{e\_\{i\}=\(\\Phi\_\{i\},t\_\{i\}\)\\\}\|\_\{i=1\}^\{N\}\. To constructℰ\\mathcal\{E\}, we employ acontrolled error injectionstrategy\. Specifically, given an ML task with training set𝒟t\\mathcal\{D\}\_\{t\}, validation set𝒟v\\mathcal\{D\}\_\{v\}, and a task modelff, we introduce random mixed\-type errors into a clean data subset via the following protocols: \(1\) LEs are generated by perturbing raw annotations, such as flipping labels for classification or adding noise for regression\. \(2\) FEs are introduced by compromising feature integrity or violating attribute dependencies, e\.g\., injecting random noise or disrupting logical constraints\. \(3\) SCs are synthesized by manipulating subgroup ratios to induce strong correlations between non\-causal attributes and target labels\. Crucially, this process explicitly supports error overlap, enabling the generation of samples exhibiting compound error types\. Furthermore, by varying injection ratios and random seeds, we can explicitly control the error intensity and synthesize highly diverse training datasets for DEC tailored to specific tasks\.

To optimize DEC for data debugging accuracy, we minimize the standard element\-wise Binary Cross\-Entropy \(BCE\) loss\. Given model predictionst^i\\hat\{\\rm t\}\_\{i\}and labelsti\{\\rm t\}\_\{i\}, the prediction loss is defined as:

\(6\)ℒpred=−∑k=13\[ti\(k\)log⁡\(t^i\(k\)\)\+\(1−ti\(k\)\)log⁡\(1−t^i\(k\)\)\]\.\\mathcal\{L\}\_\{\\rm pred\}=\-\\sum\_\{k=1\}^\{3\}\\left\[\{\\rm t\}\_\{i\}^\{\(k\)\}\\log\(\\hat\{\\rm t\}\_\{i\}^\{\(k\)\}\)\+\(1\-\{\\rm t\}\_\{i\}^\{\(k\)\}\)\\log\(1\-\\hat\{\\rm t\}\_\{i\}^\{\(k\)\}\)\\right\]\.
However, relying solely on Equation \([6](https://arxiv.org/html/2606.11616#S3.E6)\) presents a critical challenge\. The calculation of influence vectors is inherently context\-dependent, relying not only on the training sample itself but also on the configurations including the choice of validation set𝒟v\\mathcal\{D\}\_\{v\}and the task modelff\. Consequently, standard DEC training risks inadvertently encoding these configuration\-relevant artifacts alongside causal patterns related to error types into the learned representations\. This causes the DEC to overfit to specific task configurations, thereby impeding generalization\. To address this, drawing inspiration from invariant learning\(Wu et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib37)\), we propose an intervention\-based training strategy\. As illustrated in Figure[3](https://arxiv.org/html/2606.11616#S3.F3), we systematically perturb the validation set selection and model architecture for the same training data, effectively simulating diverse environments\. By introducing specific regularization terms, we compel the Set Transformer to learn invariant representations that are robust to different configurations\.

\(1\)Interventions on Validation Set\.Since influence calculation relies on the validation set, using a fixed set inevitably introduces specific biases unrelated to error types\. To mitigate this, we operate interventions by randomly sampling multiple validation subsets\. Specifically, for a training sampleeie\_\{i\}, we generate augmented views\{ei\(1\),…,ei\(V\)\}\\\{e\_\{i\}^\{\(1\)\},\\dots,e\_\{i\}^\{\(V\)\}\\\}derived fromVVdistinct validation subsets\. Since the true error type is invariant to the choice of validation set, we enforce representation invariance using an InfoNCE\-inspired contrastive loss\(Oord et al\.,[2018](https://arxiv.org/html/2606.11616#bib.bib27)\)\. Lethi\{\\rm h\}\_\{i\}be the anchor embedding and\{hi\(1\),…,hi\(V\)\}\\\{\{\\rm h\}\_\{i\}^\{\(1\)\},\\dots,\{\\rm h\}\_\{i\}^\{\(V\)\}\\\}be its positive pairs from different validation set\. The loss is defined as:

\(7\)ℒV=−𝔼hi,hi\(v\)\[log⁡esim\(hi,hi\(v\)\)/τesim\(hi,hi\(v\)\)/τ\+∑j∈ℬ,tj≠tiesim\(hi,hj\)/τ\],\\mathcal\{L\}\_\{V\}=\-\\mathbb\{E\}\_\{\{\\rm h\}\_\{i\},\{\\rm h\}\_\{i\}^\{\(v\)\}\}\\left\[\\log\\frac\{e^\{\\text\{sim\}\(\{\\rm h\}\_\{i\},\{\\rm h\}\_\{i\}^\{\(v\)\}\)/\\tau\}\}\{e^\{\\text\{sim\}\(\{\\rm h\}\_\{i\},\{\\rm h\}\_\{i\}^\{\(v\)\}\)/\\tau\}\+\\sum\_\{j\\in\\mathcal\{B\},\{\\rm t\}\_\{j\}\\neq\{\\rm t\}\_\{i\}\}e^\{\\text\{sim\}\(\{\\rm h\}\_\{i\},\{\\rm h\}\_\{j\}\)/\\tau\}\}\\right\],wheresim\(⋅,⋅\)\\text\{sim\}\(\\cdot,\\cdot\)denotes cosine similarity,τ\\tauis the temperature hyperparameter, andℬ\\mathcal\{B\}is the current batch\. MinimizingℒV\\mathcal\{L\}\_\{V\}effectively pulls the representations of positive pairs closer while pushing negative pairs apart\. This optimization compels DEC to distill a representationhi\{\\rm h\}\_\{i\}that is invariant to the validation set and solely dependent on patterns related to data error labels\.

\(2\)Interventions on Task Model\.Similarly, the task model architecture constitutes another configuration for influence vectors\. Different architectures produce influence vectors with vastly different statistical distributions \(e\.g\., scale and sparsity\), creating distribution shifts that hinder transferability\. To ensure DEC learns representations invariant to task model architecture evolving, we simulate diverse environments by constructing an ensemble ofKKheterogeneous models,\{f1,…,fK\}\\\{f\_\{1\},\\dots,f\_\{K\}\\\}\. Consequently, for each training sampleziz\_\{i\}, we compute a set ofKKinfluence vectors\{Φi\(1\),…,Φi\(K\)\}\\\{\\Phi\_\{i\}^\{\(1\)\},\\dots,\\Phi\_\{i\}^\{\(K\)\}\\\}, corresponding to theKKdistinct task models\. To explicitly align the representation spaces across these diverse architectures, we employ an alignment loss\. Lethi\(k\)\{\\rm h\}\_\{i\}^\{\(k\)\}denote the representation derived from the influence vector of theii\-th sample under thekk\-th modelfkf\_\{k\}\. Since allKKrepresentations correspond to the same error type, they should be mapped to an identical point in the representation space, regardless of the task model\. We therefore minimize the pairwise Euclidean distance between representations of the same sample across all task models\. The alignment loss is defined as:

\(8\)ℒM=1K\(K−1\)∑1≤a≠b≤K‖hi\(a\)−hi\(b\)‖22\.\\mathcal\{L\}\_\{M\}=\\frac\{1\}\{K\(K\-1\)\}\\sum\_\{1\\leq a\\neq b\\leq K\}\\left\\\|\{\\rm h\}\_\{i\}^\{\(a\)\}\-\{\\rm h\}\_\{i\}^\{\(b\)\}\\right\\\|^\{2\}\_\{2\}\.MinimizingℒM\\mathcal\{L\}\_\{M\}penalizes the discrepancy between representations from different task models, compelling the DEC to extract consistent, model\-invariant features that are generalizable to task model variations\.

The overall objective function of DEC integrates the primary prediction loss with our proposed regularization terms for invariant learning\. Formally, the overall loss is defined as:

\(9\)ℒDEC=ℒpred\+λ1ℒV\+λ2ℒM,\\mathcal\{L\}\_\{\\rm DEC\}=\\mathcal\{L\}\_\{\\rm pred\}\+\\lambda\_\{1\}\\mathcal\{L\}\_\{V\}\+\\lambda\_\{2\}\\mathcal\{L\}\_\{M\},whereλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}are trade\-off hyperparameters that balance the contribution of validation set invariance and task model invariance\. We detail the hyperparameter settings in the implementation details of Section[4\.1](https://arxiv.org/html/2606.11616#S4.SS1)\.

Algorithm 1Training Data Debugging Workflow in DeMix0:Training set

𝒟t=\{zi\}i=1N\\mathcal\{D\}\_\{t\}=\\\{z\_\{i\}\\\}\_\{i=1\}^\{N\}, Validation set

𝒟v=\{zj\}j=1M\\mathcal\{D\}\_\{v\}=\\\{z\_\{j\}\\\}\_\{j=1\}^\{M\}, DEC

gψg\_\{\\psi\}, repair tools

ℛ=\{Rt\}t∈𝒯\\mathcal\{R\}=\\\{R\_\{t\}\\\}\_\{t\\in\\mathcal\{T\}\}
0:Corrected training set

𝒟tr\\mathcal\{D\}\_\{t\}^\{\\rm r\}
1:Train task model

θ^\\hat\{\\theta\}on

𝒟t\\mathcal\{D\}\_\{t\}
2:Initialize

𝒟tr←∅\\mathcal\{D\}\_\{t\}^\{\\rm r\}\\leftarrow\\emptyset
3:foreach training sample

zi∈𝒟tz\_\{i\}\\in\\mathcal\{D\}\_\{t\}do

4:Compute influence vector

Φi\\Phi\_\{i\}on

𝒟v\\mathcal\{D\}\_\{v\}and

θ^\\hat\{\\theta\}with Equation \([3](https://arxiv.org/html/2606.11616#S3.E3)\)

5:Predict data error types via DEC:

t^i←gψ\(Φi\)\\hat\{\\rm t\}\_\{i\}\\leftarrow g\_\{\\psi\}\(\\Phi\_\{i\}\)
6:if

t^i=\(0,0,0\)\\hat\{\\rm t\}\_\{i\}=\(0,0,0\)then

7:

𝒟tr←𝒟tr∪\{zi\}\\mathcal\{D\}\_\{t\}^\{\\rm r\}\\leftarrow\\mathcal\{D\}\_\{t\}^\{\\rm r\}\\cup\\\{z\_\{i\}\\\}⊳\\trianglerightKeep clean samples

8:else

9:

z^i←zi\\hat\{z\}\_\{i\}\\leftarrow z\_\{i\}
10:foreach

t∈𝒯t\\in\\mathcal\{T\}where

t^i\\hat\{\\rm t\}\_\{i\}is activated for

ttdo

11:

z^i←Rt\(z^i\)\\hat\{z\}\_\{i\}\\leftarrow R\_\{t\}\(\\hat\{z\}\_\{i\}\)⊳\\trianglerightApply type\-specific repair

12:endfor

13:

𝒟tr←𝒟tr∪\{z^i\}\\mathcal\{D\}\_\{t\}^\{\\rm r\}\\leftarrow\\mathcal\{D\}\_\{t\}^\{\\rm r\}\\cup\\\{\\hat\{z\}\_\{i\}\\\}
14:endif

15:endfor

16:return

𝒟tr\\mathcal\{D\}\_\{t\}^\{\\rm r\}

### 3\.4\.Overall Debugging Workflow

The overall debugging workflow of DeMix is outlined in Algorithm[1](https://arxiv.org/html/2606.11616#alg1)\. First, DeMix computes influence vectors for training samples\. The influence vectors are then fed into a trained DEC, which identifies erroneous samples and diagnoses their specific error types\. Subsequently, DeMix executes type\-specific repair on the detected samples; detailed implementation of these repair tools is provided in the implementation details of Section[4\.1](https://arxiv.org/html/2606.11616#S4.SS1)\.

Complexity Analysis\.GivenNNtraining samples,MMvalidation samples, and a task model withPPparameters, the time complexity of data debugging comprises three sequential stages: task model training, influence vector computation, and error type classification\. First, training the task model forTTepochs incurs a complexity of𝒪\(N⋅P⋅T\)\\mathcal\{O\}\(N\\cdot P\\cdot T\)\. Second, influence vector computation involves estimating the Hessian\-vector product for each validation sample via a stochastic algorithm\(Koh and Liang,[2017](https://arxiv.org/html/2606.11616#bib.bib18)\)and multiplying it by the training sample gradients, leading to𝒪\(N⋅M⋅P\)\\mathcal\{O\}\(N\\cdot M\\cdot P\)\. Finally, the error classifier inference requires approximately𝒪\(N⋅M2\)\\mathcal\{O\}\(N\\cdot M^\{2\}\), dominated by the self\-attention mechanism\. Consequently, the total time complexity is approximately𝒪\(N\(PT\+PM\+M2\)\)\\mathcal\{O\}\(N\(PT\+PM\+M^\{2\}\)\)\. Regarding space complexity, our method requires𝒪\(P\)\\mathcal\{O\}\(P\)for storing model parameters and𝒪\(N⋅M\)\\mathcal\{O\}\(N\\cdot M\)for the influence vectors\.

## 4\.Experiments

In this section, we evaluate DeMix through comprehensive experiments, aiming to answer the following research questions:

RQ1 \(Debugging performance\):How accurately can DeMix distinguish erroneous data along with their error types, particularly under different ratios of data errors?

RQ2 \(Repair efficacy\):Can DeMix be effectively utilized to repair data and improve task model performance?

RQ3 \(Ablation studies\):What is the specific contribution of each component within DeMix to its overall effectiveness?

### 4\.1\.Experimental Setup

Datasets and Models\.To the best of our knowledge, no existing real\-world dataset simultaneously contains mixed error types with known error\-type labels, we therefore adopt controlled error injection into clean base data for systematically evaluation of data debugging methods\. To demonstrate the versatility of the evaluation, we select clean base datasets spanning three diverse domains:\(1\) Tabular data prediction: We select six standard benchmark datasets from the UCI repository333https://archive\.ics\.uci\.edu/dataset/, covering diverse task types: binary classification \(Adult,Bank, andCredit\), multi\-class classification \(Covertype\), and regression \(Bike SharingandAir Quality\)\. For task model architectures, we employ 3 models: a 2\-layer MLP1, a 4\-layer MLP2 and a state\-of\-the\-art Transformer\-based model, TabPFN\(Hollmann et al\.,[\[n\. d\.\]](https://arxiv.org/html/2606.11616#bib.bib14)\)\. \(2\) Recommendation systems: We utilize three widely adopted datasets:Amazon444https://nijianmo\.github\.io/amazon/index\.html,MovieLens555https://grouplens\.org/datasets/movielens/, andYelp666https://www\.kaggle\.com/datasets/yelp\-dataset/yelp\-dataset\. For task model architectures, we employ three deep recommendation models DIN\(Zhou et al\.,[2018](https://arxiv.org/html/2606.11616#bib.bib47)\), DIEN\(Zhou et al\.,[2019](https://arxiv.org/html/2606.11616#bib.bib46)\)and DHEN\(Zhang et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib41)\)\. \(3\) LLM Alignment: To evaluate the scalability of DeMix to large\-scale generative models, we extend our evaluation to the LLM preference alignment task\. We employ theUltraFeedback777https://huggingface\.co/datasets/trl\-lib/ultrafeedback\_binarizedandCapybara888https://huggingface\.co/datasets/trl\-lib/Capybara\-Preferencespreference datasets to fine\-tune a Qwen\-0\.5B\-Instruct\(Team,[2024](https://arxiv.org/html/2606.11616#bib.bib31)\)model via Direct Preference Optimization \(DPO\)\(Rafailov et al\.,[2023](https://arxiv.org/html/2606.11616#bib.bib29)\)\. The alignment quality of the task model is subsequently evaluated on the AlpacaEval benchmark999https://github\.com/tatsu\-lab/alpaca\_eval\.

Beyond synthetic injected data errors, to verify the practical validity of data debugging methods on datasets with naturally data errors, we introduce two publicly available datasets:CIFAR\-10N101010https://github\.com/UCSC\-REAL/cifar\-10\-100nprovides human\-annotated LEs for CIFAR\-10 images, andCelebA111111https://www\.kaggle\.com/datasets/jessicali9530/celeba\-datasetis a face attribute prediction dataset with natural SC between gender and hair color\.

Controlled Error Injection\.To verify the effectiveness of the data debugging method, we inject data errors into the datasets and expect the debugging methods to accurately identify and repair these injected errors\. We first split all datasets into training, validation, and test sets with a ratio of\(0\.6,0\.2,0\.2\)\(0\.6,0\.2,0\.2\), and then inject random mixed\-type errors into the training and validation sets\.We retain a proportionα\\alphaof raw data as clean samples\. For the remaining\(1−α\)\(1\-\\alpha\), we introduce various error types, allowing error overlap, where a single sample may simultaneously harbor multiple error types\. Injection of specific error types is conducted as follows:

\(1\) Label Errors \(LEs\): For tabular data classification tasks, we flip the ground\-truth label to a different class; For tabular data regression tasks, we add random noise to the original labels\. For recommendation tasks, where labels are typically binary \(indicating user\-item interactions such as clicks or reviews\), we flip the interaction status \(e\.g\.,1→01\\to 0\)\. For LLM alignment tasks, which rely on preference pairs \(“chosen” vs\. “rejected”\), we inject errors by swapping the “chosen” and “rejected” responses along with their corresponding reward scores\.

\(2\) Feature Errors \(FEs\): For tabular data prediction and recommendation systems, where data typically consists of various feature columns, we randomly select several feature columns and inject different types of feature errors, including outliers, attribute dependency violations, categorical feature flipping, and format inconsistencies\(Ding et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib11)\)\. Additionally, for sequential features in recommendation \(e\.g\., user interaction history\), we apply random shuffling, deletion, or replacement\. For LLM alignment, we introduce feature errors by randomly replacing or deleting tokens in the response text\.

\(3\) Spurious Correlations \(SCs\): To induce reliance on non\-causal shortcuts, we first construct a set of spurious attributes𝒜\\mathcal\{A\}following\(Tong et al\.,[\[n\. d\.\]](https://arxiv.org/html/2606.11616#bib.bib33)\)\. We randomly select one spurious attributea∈𝒜a\\in\\mathcal\{A\}and ensure that the proportions of majority and minority groups with respect toaaare approximately 90% and 10%, respectively\. The choice of attributes varies by domain: for tabular and recommendation data, we leverage non\-causal attributes such as gender or race\(Tong et al\.,[\[n\. d\.\]](https://arxiv.org/html/2606.11616#bib.bib33)\); for LLM preference data, we use attributes that are not directly related to response quality, such as response length, format structure, emoji, etc\.\(Chen et al\.,[2024b](https://arxiv.org/html/2606.11616#bib.bib5); Zhang et al\.,[2025a](https://arxiv.org/html/2606.11616#bib.bib42)\)\.

Repair Tools\.To repair the identified errors, we employ specific repair strategies tailored to each error type\. \(1\) For LEs, the repair mechanism depends on the task: regarding tabular data classification and recommendation systems, following\(Kong et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib19); Kuan and Mueller,[2022b](https://arxiv.org/html/2606.11616#bib.bib21)\), we correct noisy labels by replacing them with the class indices maximizing the predicted probabilities derived fromkk\-fold cross\-validation; regarding regression tasks, we essentially discard the detected LE samples to prevent skewing the decision boundary; regarding LLM alignment tasks, we switch the “chosen” and “reject” responses\. \(2\) For FEs, following\(Ding et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib11); Kuan and Mueller,[2022a](https://arxiv.org/html/2606.11616#bib.bib20)\)we adopt a statistical imputation approach based on Z\-scores for tabular data prediction and recommendation systems\. We first compute the mean and standard deviation of features using only clean samples \(per\-class for classification, global for regression\)\. Feature values in erroneous samples with Z\-scores exceeding a threshold are deemed outliers and replaced by the corresponding mean value\. Regarding LLM alignment, we discard samples with FEs\. \(3\) Finally, for SCs, rather than modifying feature values, we implement a sample reweighting strategy following\(Ye et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib39)\)\. To mitigate the model’s reliance on majority shortcuts, we assign a higher importance weight to samples from minority groups, thereby forcing the model to focus more on these underrepresented instances during training\.

Evaluation Metrics\.We evaluate the effectiveness of data debugging methods from two key dimensions:\(1\) Debugging F1: We employ the average F1 scores to assess the method’s precision in identifying specific error types within mixed\-error data\. For each specific error type, we first calculate the F1\-scoreF1tF1\_\{t\}for each specific error typet∈𝒯t\\in\\mathcal\{T\}, and the overall debugging performance is quantified by the average ofF1tF1\_\{t\}across all error types; \(2\) Repair efficacy: We report the task model’s performance \(Accuracy and MSE for tabular data prediction; AUC for recommendation; and WinRate against GPT4 for LLM alignment\) on the test set after training on the repaired training data\.

Baselines\.\(1\)Clean Data:The unperturbed raw data; \(2\)Error\. Data:The data with mixed data error types through controlled error injection; \(3\)Debugging with Data Cleaning \(DDC\):Cleanlab\(Kuan and Mueller,[2022b](https://arxiv.org/html/2606.11616#bib.bib21),[a](https://arxiv.org/html/2606.11616#bib.bib20)\)provides state\-of\-the\-art detection tools for label issues, outliers, and underperforming groups121212https://docs\.cleanlab\.ai/stable/cleanlab/datalab/guide/issue\_type\_description\.html\. Following\(Siddiqi et al\.,[2023](https://arxiv.org/html/2606.11616#bib.bib30)\), we sequentially organize these cleaning tools into a pipeline DDC for debugging data with mixed error types; \(4\)Debugging with Data Attribution \(DDA\):Following state\-of\-the\-art data attribution methods\(Jiang et al\.,[2023](https://arxiv.org/html/2606.11616#bib.bib16); Kong et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib19)\), we construct two baselines:DDA\-select, which computes scalar influence scores for training samples by aggregating influence function values across validation data and discards those with low scores\(Jiang et al\.,[2023](https://arxiv.org/html/2606.11616#bib.bib16); Xia et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib38)\); andDDA\-repair, which first identifies detrimental samples with influence scores, then clusters their influence vectors to infer error types based on heuristics\(Kong et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib19); Myrtakis et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib26); Zheng et al\.,[2024](https://arxiv.org/html/2606.11616#bib.bib45); Wang et al\.,[2023](https://arxiv.org/html/2606.11616#bib.bib34)\), enabling type\-specific repair\. Note that debugging F1 can only be computed for DDA\-repair, as DDA\-select performs binary selection without debugging\. \(5\)DeMix: Our framework uses a DEC for error type classification, followed by type\-specific repair\. In DeMix, we train a specific DEC for each dataset and use it for debugging\. We further proposeDeMix\-unif, a variant where a unified DEC is trained with a diverse dataset collection and directly deployed to target datasets for debugging\. More details about the workflow of baselines are provided in Appendix[D](https://arxiv.org/html/2606.11616#A4)\.

Implementation Details\.\(1\) We implement the DEC using a Set Transformer with 2 SAB layers and 1 PMA layer, followed by 3 MLP\-based task heads\. For DeMix, we set\|ℰ\|=50,000\|\\mathcal\{E\}\|=50,000per dataset; for DeMix\-unif, the unified DEC is trained on a composite dataset with\|ℰ\|=20,000\|\\mathcal\{E\}\|=20,000samples drawn from each source dataset\. For interventions on the validation set, we setV=10V=10; for interventions on the task model, we setK=3K=3for tabular data prediction and recommendation systems,K=1K=1for LLM alignment\. Regarding loss coefficients, we setλ1=0\.1\\lambda\_\{1\}=0\.1andλ2=0\.1\\lambda\_\{2\}=0\.1\. \(2\) As for data repair tools, we apply targeted strategies for each type of data errors: LEs are repaired via probability\-based relabeling\(Kong et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib19)\); FEs are repaired by locating and correcting perturbed feature columns\(Ding et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib11)\); and SCs are mitigated by up\-weighting identified minority samples during retraining\(Ye et al\.,[2025](https://arxiv.org/html/2606.11616#bib.bib39); Liu et al\.,[2021](https://arxiv.org/html/2606.11616#bib.bib24)\)\. Details about repair tools are available in Appendix[4\.1](https://arxiv.org/html/2606.11616#S4.SS1)\. We keep the repair strategies the same for different debugging baselines\. \(3\) All experiments were conducted on a server equipped with a Montage Jintide®C6226R CPU, 256GB RAM, and four NVIDIA GeForce RTX 4090 GPUs\.

![Refer to caption](https://arxiv.org/html/2606.11616v1/x4.png)Figure 4\.Debugging F1\-score \(%\) on 11 datasets across 5 independent runs under varying clean data ratiosα\\alpha\.![Refer to caption](https://arxiv.org/html/2606.11616v1/x5.png)Figure 5\.Debugging F1\-score \(%,α=0\.5\\alpha=0\.5\) of DeMix for specific error types\.
### 4\.2\.Debugging Performance

To answerRQ1, we first analyze the average debugging F1 across all error types and for each specific error type, followed by a granular analysis on hard cases when a single training sample contains multiple error types\.

Overall Performance\.As illustrated in Fig\.[4](https://arxiv.org/html/2606.11616#S4.F4), DeMix shows superior debugging F1 compared to state\-of\-the\-art baselines\. We summarize the key observations as follows: \(1\)Effectiveness of DeMix:DeMix consistently outperforms existing baselines across nearly all datasets andα\\alpha\. Quantitatively, it achieves an average improvement of22\.61%in F1\-score\. This substantial margin validates the capability of DeMix in accurately data debugging, irrespective of data formats \(tabular, recommendation, or text\) or error ratios\. \(2\)Robustness of DeMix:A distinguishing feature of DeMix is its stability against variations in error ratios\. As observed in Fig\.[4](https://arxiv.org/html/2606.11616#S4.F4), while baseline performance fluctuates significantly with changes inα\\alpha\(e\.g\., DDA\-repair on Adult and Bank\), DeMix maintains a consistently high performance curve with minimal variance\. This stability suggests that DeMix successfully captures the intrinsic patterns of data errors rather than overfitting to distributional statistics that shift with error ratios, thereby ensuring reliable data debugging in dynamic real\-world scenarios\. \(3\)Type\-specific performance: Figure[5](https://arxiv.org/html/2606.11616#S4.F5)details the debugging performance of DeMix across specific error types\. Overall, DeMix excels at detecting SCs and LEs but shows lower sensitivity to FEs, as they are inherently stealthy and induce weaker influence signals\. This limitation is notably observed in LLM alignment tasks, where debugging F1 gains are modest\. We believe this results from the approximation error of influence function estimation in large\-scale models such as Llama\.

Table 1\.Debugging F1\-score \(%,α=0\.5\\alpha=0\.5\) on hard cases where a single sample contains multiple error types\.Analysis on Hard Cases\.To further evaluate the robustness of DeMix, we analyze the performance on a subset of hard cases where a single sample containing multiple co\-occurring error types, i\.e\., for a sampleziz\_\{i\},∑k=13t^i\(k\)≥2\\sum\_\{k=1\}^\{3\}\\hat\{\\rm t\}\_\{i\}^\{\(k\)\}\\geq 2\. As shown in Table[1](https://arxiv.org/html/2606.11616#S4.T1), DeMix demonstrates superior capability in handling these hard cases\. It achieves the best performance on 5 out of 6 datasets and provides an average improvement of 21\.21% in debugging F1\-score over baselines\. While baselines like DDA\-repair rely on scalar influence scores that often produce conflicting signals when errors overlap, DeMix treats debugging as a multi\-label classification task\. By leveraging the high\-dimensional semantic information within influence vectors, our method can effectively disentangle mixed error patterns\.

Table 2\.Debugging F1\-score \(%\) on datasets without error injection\.Practical Validity\.The practical effectiveness of DeMix when deployed on real world applications are two\-fold: \(1\) Detecting naturally occurring errors\. As shown in Table[2](https://arxiv.org/html/2606.11616#S4.T2), we evaluated DeMix on datasets with no synthetic error injection\. The debugging F1 scores shows that DeMix consistently outperforms the strongest baseline \(DDA\-repair\) in identifying naturally occurring errors\. As a case study in CelebA \(with SC between gender and hair color\), DeMix achieves high F1 in detecting minority samples \(e\.g\., blond males\), highlighting its practical diagnostic capability\. \(2\) Robustness to noisy base data\. In practice, the DEC requires only a small clean base data for error injection \(e\.g\., 10K samples on the Amazon dataset yield ¿83% debugging F1\-score\), which can be directly sourced from the clean validation set\. Critically, even when the base data itself is noisy \(30% pre\-existing random errors\), DeMix’s debugging F1 drops by only 2\.15%, demonstrating robustness to realistic deployment conditions where perfectly clean base data may not be available\.

Table 3\.Performance comparison of data repair across 11 datasets from 3 domains withα=0\.5\\alpha=0\.5\. We report the mean and standard deviation of performance \(%\) over 5 independent runs\. The best results are inbold\.
### 4\.3\.Repair Efficacy

To answerRQ2, we conduct type\-specific data repair based on the classification results from DEC and use the repaired data to retrain the task model and observe its test performance\. Table[3](https://arxiv.org/html/2606.11616#S4.T3)presents a comprehensive performance evaluation of DeMix across 11 datasets covering 3 distinct domains\. In these experiments, we fixα=0\.5\\alpha=0\.5\. For task models, we adopt MLP1 for tabular prediction, DIN for recommendation, and Qwen2\.5\-0\.5B\-Instruct for LLM alignment\. The experimental results lead to the following key observations: \(1\)Impact of mixed error types:The introduction of mixed error types significantly degrades the performance of task models\. As evidenced by the results of the Error\. Data without any data debugging \(second row\), models trained on such contaminated data exhibit substantial performance degradation \(19\.60% on average\) compared to the Clean Data baseline\. This considerable gap underscores the critical need for effective training data debugging methods in practical ML pipelines\. \(2\)Effectiveness of DeMix:Our proposed DeMix consistently outperforms other data debugging baselines\. On average, DeMix surpasses all data debugging baselines by 9\.32% on average, demonstrating that our framework can perform effective debugging for type\-specific repair\. \(3\)Transferability of DeMix\-unif:While DeMix generally achieves the highest performance due to its task\-specific specialization, DeMix\-unif remains highly competitive\. Notably, DeMix\-unif attains the best performance on the Air Quality and Amazon datasets\. This indicates that our DEC is capable of capturing transferable data error patterns across datasets without a significant performance drop\. \(4\)Importance of data repair:Results of DeMix w/o\. repair and DeMix\-unif w/o\. repair \(the 7th and 9th rows of Table[3](https://arxiv.org/html/2606.11616#S4.T3)\) reveal that simply discarding erroneous samples is suboptimal\. The full DeMix framework, which incorporates data repair, consistently outperforms the removal\-only variant\. This confirms that data repair is crucial for recovering valuable information from erroneous samples\. \(5\)Limitations of baselines:We observe that DDA\-repair underperforms DDA\-select on most datasets\. Correlating this observation with Section[4\.2](https://arxiv.org/html/2606.11616#S4.SS2), we attribute this deficiency to the limited debugging F1 of DDA in data with mixed error types\. Data repair based on inaccurate debugging results introduces additional noise, thereby undermining model training\. Furthermore, DDC provides only marginal improvements, indicating that data cleaning is inadequate for handling data with complex systematic errors\.

### 4\.4\.Ablation Studies

To answerRQ3, we conducted ablation studies to evaluate the marginal contributions ofℒV\\mathcal\{L\}\_\{V\}andℒM\\mathcal\{L\}\_\{M\}\. For tabular data classification tasks, we utilize three task models \(MLP1, MLP2, and TabPFN\) and two disjoint validation sets,𝒟v1\\mathcal\{D\}\_\{v\}^\{1\}and𝒟v2\\mathcal\{D\}\_\{v\}^\{2\}\. We constructed three configurations for influence vector computation:cf1\{\\rm cf\}\_\{1\}\(𝒟v1\\mathcal\{D\}\_\{v\}^\{1\}and MLP1&MLP2\),cf2\{\\rm cf\}\_\{2\}\(𝒟v2\\mathcal\{D\}\_\{v\}^\{2\}and MLP1&MLP2\), andcf3\{\\rm cf\}\_\{3\}\(𝒟v1\\mathcal\{D\}\_\{v\}^\{1\}and TabPFN\)\. The DEC is trained on influence vectors computed oncf1\{\\rm cf\}\_\{1\}, and evaluated oncf2\{\\rm cf\}\_\{2\}andcf3\{\\rm cf\}\_\{3\}, to simulate real\-world scenarios when validation sets and model architectures evolve\. Table[4](https://arxiv.org/html/2606.11616#S4.T4)presents the results\. First, removingℒV\\mathcal\{L\}\_\{V\}causes marked performance degradation oncf2\{\\rm cf\}\_\{2\}, with F1 scores dropping by an average of 3\.5% across datasets\. This highlights the necessity ofℒV\\mathcal\{L\}\_\{V\}in preventing DEC from overfitting to specific validation sets\. Second, excludingℒM\\mathcal\{L\}\_\{M\}also impairs generalization to the unseen TabPFN incf3\{\\rm cf\}\_\{3\}, evidenced by a substantial drop of 3\.0% on the Credit dataset\. This confirms thatℒM\\mathcal\{L\}\_\{M\}effectively aligns influence representations across diverse model structures\. Finally, the removal of both terms \(w/oℒV,ℒM\\mathcal\{L\}\_\{V\},\\mathcal\{L\}\_\{M\}\) yields the lowest performance, demonstrating that our proposed invariant learning framework is critical for generalizable training data debugging\.

Table 4\.Ablation Studies \(debugging F1, %\) of DeMix under different configurations \(cf1\\rm cf\_\{1\},cf2\\rm cf\_\{2\}andcf3\\rm cf\_\{3\}\) for influence vector computation\. The DEC of DeMix is trained oncf1\\rm cf\_\{1\}, and evaluated oncf2\\rm cf\_\{2\}andcf3\\rm cf\_\{3\}\.
### 4\.5\.Deployment Costs

Table[5](https://arxiv.org/html/2606.11616#S5.T5)reports the deployment costs of DeMix on the MovieLens dataset \(\>\>1\.3M samples\)\. DeMix achieves comparable inference efficiency to DDA\-repair with substantially higher debugging F1\. Both methods require computing the influence values of training samples on validation samples; however, DDA\-repair averages these values into a scalar, while DeMix retains them as a vector\. DeMix’s additional overhead relative to DDA\-repair is the DEC inference pass \(∼\\sim3 min\), which is negligible compared to the overall pipeline cost\. For large\-scale datasets, DeMix scales efficiently via validation sampling: using only 10% of validation samples for influence computation reduces both runtime \(from 35\.4 to 6\.6 min\) and peak GPU memory \(from 22\.6 to 9\.2 GB\), while incurring a negligible drop in debugging F1 \(∼\\sim0\.2%\)\.

## 5\.Conclusion

In this paper, we introduce DeMix, a novel framework that leverages influence vectors for debugging training data with mixed error types\. Our key insight is that different error types induce distinct patterns in model predictions across the validation set, which are effectively captured by influence vectors\. Building on this insight, we formulate training data debugging as a multi\-label classification problem that takes influence vectors as input features\. We design a classifier to predict error types and introduce an intervention\-based training strategy that ensures the classifier captures invariant and error\-specific rationales, improving its generalization across different training setups\. Extensive experiments are conducted on 11 tasks spanning tabular data prediction, recommendation systems, and LLM alignment\. The results demonstrate that DeMix significantly outperforms state\-of\-the\-art baselines, achieving substantial improvements in debugging F1 score and downstream task model performance after targeted repair\.

Table 5\.Deployment costs on MovieLens on a server with NVIDIA RTX 4090 GPU\.###### Acknowledgements\.

This work is supported by the National Key Research and Development Program of China \(No\. 2024YFB4505203\), National Natural Science Foundation of China \(No\. 62522211\), and Key Research and Development Program of Xinjiang Uygur Autonomous Region \(Grant No\. 2023B01027, 2023B01027\-1\)\.

## References

- \(1\)
- Alemi et al\.\(2017\)Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy\. 2017\.Deep Variational Information Bottleneck\. In*International Conference on Learning Representations*\.
- Bao et al\.\(2024\)Xianchun Bao, Zian Bao, Bie Binbin, QingSong Duan, Wenfei Fan, Hui Lei, Daji Li, Wei Lin, Peng Liu, Zhicong Lv, et al\.2024\.Rock: Cleaning Data by Embedding ML in Logic Rules\. In*Companion of the 2024 International Conference on Management of Data*\. 106–119\.
- Chen et al\.\(2024a\)Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro\. 2024a\.ODIN: Disentangled Reward Mitigates Hacking in RLHF\. In*International Conference on Machine Learning*\. PMLR, 7935–7952\.
- Chen et al\.\(2024b\)Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro\. 2024b\.ODIN: Disentangled Reward Mitigates Hacking in RLHF\. In*International Conference on Machine Learning*\. PMLR, 7935–7952\.
- Choe et al\.\(2024\)Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, et al\.2024\.What is your data worth to gpt? llm\-scale data valuation with influence functions\.*arXiv preprint arXiv:2405\.13954*\(2024\)\.
- Chu et al\.\(2016\)Xu Chu, Ihab F Ilyas, Sanjay Krishnan, and Jiannan Wang\. 2016\.Data cleaning: Overview and emerging challenges\. In*Proceedings of the 2016 international conference on management of data*\. 2201–2206\.
- Deng et al\.\(2025\)Junwei Deng, Yuzheng Hu, Pingbang Hu, Ting\-Wei Li, Shixuan Liu, Jiachen T Wang, Dan Ley, Qirun Dai, Benhao Huang, Jin Huang, et al\.2025\.A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI\.\(2025\)\.
- Deng et al\.\(\[n\. d\.\]\)Jiale Deng, Yanyan Shen, Ziyuan Pei, Youmin Chen, and Linpeng Huang\. \[n\. d\.\]\.Influence Guided Context Selection for Effective Retrieval\-Augmented Generation\. In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*\.
- Deng et al\.\(2024\)Yuhao Deng, Chengliang Chai, Lei Cao, Nan Tang, Jiayi Wang, Ju Fan, Ye Yuan, and Guoren Wang\. 2024\.MisDetect: Iterative Mislabel Detection using Early Loss\.*Proceedings of the VLDB Endowment*17, 6 \(2024\), 1159–1172\.
- Ding et al\.\(2025\)Xiaoou Ding, Zekai Qian, Hongzhi Wang, Siying Chen, Yafeng Tang, Hongbin Su, Huan Hu, and Chen Wang\. 2025\.UniClean: A Scalable Data Cleaning Solution for Mixed Errors based on Unified Cleaners and Optimized Cleaning Workflow\.*Proceedings of the VLDB Endowment*18, 11 \(2025\), 4117–4130\.
- Gao et al\.\(2026\)Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Chong Chen, Conghui He, Hongzhi Yin, and Wentao Zhang\. 2026\.A comprehensive survey on imbalanced data learning\.*Frontiers of Computer Science*20, 11 \(2026\), 2011622\.
- Hammoudeh and Lowd \(2024\)Zayd Hammoudeh and Daniel Lowd\. 2024\.Training data influence analysis and estimation: A survey\.*Machine Learning*113, 5 \(2024\), 2351–2403\.
- Hollmann et al\.\(\[n\. d\.\]\)Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter\. \[n\. d\.\]\.TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second\. In*The Eleventh International Conference on Learning Representations*\.
- Hu et al\.\(2025\)Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, and Jiaqi W Ma\. 2025\.GraSS: Scalable Influence Function with Sparse Gradient Compression\.*arXiv preprint arXiv:2505\.18976*\(2025\)\.
- Jiang et al\.\(2023\)Kevin Jiang, Weixin Liang, James Y Zou, and Yongchan Kwon\. 2023\.Opendataval: a unified benchmark for data valuation\.*Advances in Neural Information Processing Systems*36 \(2023\), 28624–28647\.
- Kersbergen et al\.\(2025\)Barrie Kersbergen, Olivier Sprangers, Bojan Karlaš, Maarten de Rijke, and Sebastian Schelter\. 2025\.Scalable Data Debugging for Neighborhood\-based Recommendation with Data Shapley Values\. In*Proceedings of the Nineteenth ACM Conference on Recommender Systems*\. 441–450\.
- Koh and Liang \(2017\)Pang Wei Koh and Percy Liang\. 2017\.Understanding black\-box predictions via influence functions\. In*International conference on machine learning*\. PMLR, 1885–1894\.
- Kong et al\.\(2021\)Shuming Kong, Yanyan Shen, and Linpeng Huang\. 2021\.Resolving training biases via influence\-based data relabeling\. In*International Conference on Learning Representations*\.
- Kuan and Mueller \(2022a\)Johnson Kuan and Jonas Mueller\. 2022a\.Back to the Basics: Revisiting Out\-of\-Distribution Detection Baselines\. In*ICML Workshop on Principles of Distribution Shift*\.
- Kuan and Mueller \(2022b\)Johnson Kuan and Jonas Mueller\. 2022b\.Model\-agnostic label quality scoring to detect real\-world label errors\. In*ICML DataPerf Workshop*\.
- Lee et al\.\(2019\)Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh\. 2019\.Set transformer: A framework for attention\-based permutation\-invariant neural networks\. In*International conference on machine learning*\. PMLR, 3744–3753\.
- Liang et al\.\(2022\)Weixin Liang, Girmaw Abebe Tadesse, Daniel Ho, Li Fei\-Fei, Matei Zaharia, Ce Zhang, and James Zou\. 2022\.Advances, challenges and opportunities in creating data for trustworthy AI\.*Nature Machine Intelligence*4, 8 \(2022\), 669–677\.
- Liu et al\.\(2021\)Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn\. 2021\.Just train twice: Improving group robustness without training group information\. In*International Conference on Machine Learning*\. PMLR, 6781–6792\.
- Miao et al\.\(2022\)Siqi Miao, Mia Liu, and Pan Li\. 2022\.Interpretable and generalizable graph learning via stochastic attention mechanism\. In*International conference on machine learning*\. PMLR, 15524–15543\.
- Myrtakis et al\.\(2025\)Nikolaos Myrtakis, Ioannis Tsamardinos, and Vassilis Christophides\. 2025\.Data Glitches Discovery using Influence\-based Model Explanations\. In*Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 1*\. 1068–1079\.
- Oord et al\.\(2018\)Aaron van den Oord, Yazhe Li, and Oriol Vinyals\. 2018\.Representation learning with contrastive predictive coding\.*arXiv preprint arXiv:1807\.03748*\(2018\)\.
- Peng et al\.\(\[n\. d\.\]\)Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, and Junbo Zhao\. \[n\. d\.\]\.DataMan: Data Manager for Pre\-training Large Language Models\. In*The Thirteenth International Conference on Learning Representations*\.
- Rafailov et al\.\(2023\)Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn\. 2023\.Direct preference optimization: Your language model is secretly a reward model\.*Advances in neural information processing systems*36 \(2023\), 53728–53741\.
- Siddiqi et al\.\(2023\)Shafaq Siddiqi, Roman Kern, and Matthias Boehm\. 2023\.SAGA: A scalable framework for optimizing data cleaning pipelines for machine learning applications\.*Proceedings of the ACM on Management of Data*1, 3 \(2023\), 1–26\.
- Team \(2024\)Qwen Team\. 2024\.Qwen2\.5: A Party of Foundation Models\.[https://qwenlm\.github\.io/blog/qwen2\.5/](https://qwenlm.github.io/blog/qwen2.5/)
- Tishby et al\.\(2000\)Naftali Tishby, Fernando C Pereira, and William Bialek\. 2000\.The information bottleneck method\.*arXiv preprint physics/0004057*\(2000\)\.
- Tong et al\.\(\[n\. d\.\]\)Yunze Tong, Fengda Zhang, Zihao Tang, Kaifeng Gao, Kai Huang, Pengfei Lyu, Jun Xiao, and Kun Kuang\. \[n\. d\.\]\.Latent Score\-Based Reweighting for Robust Classification on Imbalanced Tabular Data\. In*Forty\-second International Conference on Machine Learning*\.
- Wang et al\.\(2023\)Fulton Wang, Julius Adebayo, Sarah Tan, Diego Garcia\-Olano, and Narine Kokhlikyan\. 2023\.Error discovery by clustering influence embeddings\.*Advances in Neural Information Processing Systems*36 \(2023\), 41765–41777\.
- Weng et al\.\(2026\)Shihao Weng, Yang Feng, Yining Yin, Zhenlun Zhang, and Baowen Xu\. 2026\.Data preparation and quality for code\-centric generative software engineering tasks: a systematic literature review\.*Frontiers of Computer Science*20, 9 \(2026\), 2009203\.
- Wu et al\.\(2023\)Shirley Wu, Mert Yuksekgonul, Linjun Zhang, and James Zou\. 2023\.Discover and cure: Concept\-aware mitigation of spurious correlation\. In*International Conference on Machine Learning*\. PMLR, 37765–37786\.
- Wu et al\.\(2022\)Ying\-Xin Wu, Xiang Wang, An Zhang, Xiangnan He, and Tat seng Chua\. 2022\.Discovering Invariant Rationales for Graph Neural Networks\. In*ICLR*\.
- Xia et al\.\(2024\)Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, and Danqi Chen\. 2024\.LESS: Selecting Influential Data for Targeted Instruction Tuning\. In*International Conference on Machine Learning*\. PMLR, 54104–54132\.
- Ye et al\.\(2025\)Wenqian Ye, Guangtao Zheng, and Aidong Zhang\. 2025\.Improving group robustness on spurious correlation via evidential alignment\. In*Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2*\. 3610–3621\.
- Yin et al\.\(2024\)Mingjia Yin, Hao Wang, Wei Guo, Yong Liu, Suojuan Zhang, Sirui Zhao, Defu Lian, and Enhong Chen\. 2024\.Dataset regeneration for sequential recommendation\. In*Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining*\. 3954–3965\.
- Zhang et al\.\(2022\)Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, et al\.2022\.DHEN: A deep and hierarchical ensemble network for large\-scale click\-through rate prediction\.*arXiv preprint arXiv:2203\.11014*\(2022\)\.
- Zhang et al\.\(2025a\)Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, and Tong Zhang\. 2025a\.From lists to emojis: How format bias affects model alignment\. In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*\. 26940–26961\.
- Zhang et al\.\(2025b\)Yansen Zhang, Xiaokun Zhang, Ziqiang Cui, and Chen Ma\. 2025b\.Shapley value\-driven data pruning for recommender systems\. In*Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2*\. 3879–3888\.
- Zhao et al\.\(2026\)Weixiang Zhao, Yulin Hu, Xingyu Sui, Zhuojun Li, Yang Deng, Yanyan Zhao, Bing Qin, and Wanxiang Che\. 2026\.The gains do not make up for the losses: a comprehensive evaluation for safety alignment of large language models via machine unlearning\.*Frontiers of Computer Science*20, 2 \(2026\), 2002319\.
- Zheng et al\.\(2024\)Kaiping Zheng, Horng\-Ruey Chua, Melanie Herschel, HV Jagadish, Beng Chin Ooi, and James Wei Luen Yip\. 2024\.Exploiting negative samples: a catalyst for cohort discovery in healthcare analytics\. In*Forty\-first International Conference on Machine Learning*\.
- Zhou et al\.\(2019\)Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai\. 2019\.Deep interest evolution network for click\-through rate prediction\. In*Proceedings of the AAAI conference on artificial intelligence*, Vol\. 33\. 5941–5948\.
- Zhou et al\.\(2018\)Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai\. 2018\.Deep interest network for click\-through rate prediction\. In*Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining*\. 1059–1068\.

## Appendix

![Refer to caption](https://arxiv.org/html/2606.11616v1/x6.png)Figure 6\.t\-SNE visualization of influence vectors of erroneous samples, colored by error types\.
## Appendix AVisualizations of Influence Vectors

Figure[6](https://arxiv.org/html/2606.11616#Ax1.F6)shows t\-SNE visualizations of influence vectors across diverse tasks\. Samples with the same error type consistently form cohesive clusters, while different error types are well separated, demonstrating the effectiveness of influence vectors in characterizing error patterns across tabular, recommendation, and LLM alignment tasks\. For LLM alignment, the separation is less distinct, with FEs often overlapping with other error types\. This is because LLMs are relatively insensitive to token\-level perturbations; in their massive parameter space, such subtle text\-noise signals are easily diluted, causing FEs to entangle with other error categories in the projected space\.

## Appendix BCase Study

Figure[7](https://arxiv.org/html/2606.11616#A2.F7)visualizes the influence vectors of six training samples containing different error types across 20 validation instances from the Adult dataset\. The heatmap reveals that distinct error types manifest as unique patterns in the influence vectors\. Specifically, LEs exhibit consistently negative influence \(indicated by blue regions in rows 1 and 2\), reflecting a more detrimental effect on the prediction of similar validation samples \(e\.g\., sample 5380 and sample 1391 are similar\)\. This suggests that LEs are particularly harmful, likely inducing significant shifts in the decision boundary\. Conversely, influence values of FEs remain close to zero \(rows 3 and 4\), implying a negligible impact due to the stochastic nature of random feature corruption\. Notably, SCs display sharp positive spikes on specific validation samples sharing the same minority group\. For instance, training samples 7191 and 9905 strongly support validation sample 557 \(dark red\), as they share the minority attribute combination \(Female,\>\>50K\)\. These distinguishable patterns empirically validate that influence vectors effectively encode error\-specific semantics, a property our framework leverages for accurate data debugging\.

![Refer to caption](https://arxiv.org/html/2606.11616v1/x7.png)Figure 7\.A case study on the Adult dataset, where blue and red regions indicate negative and positive influence values, respectively\.![Refer to caption](https://arxiv.org/html/2606.11616v1/x8.png)Figure 8\.Overall workflow of baselines\.
## Appendix CTheoretical Analysis

The training objective of DEC is grounded in the Information Bottleneck \(IB\) principle\(Tishby et al\.,[2000](https://arxiv.org/html/2606.11616#bib.bib32); Alemi et al\.,[2017](https://arxiv.org/html/2606.11616#bib.bib2); Miao et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib25)\)\. LetΦ\\Phidenote influence vectors,EEdenote the error\-type variable, andH=qψ\(Φ\)H=q\_\{\\psi\}\(\\Phi\)denote the representation produced by the Set Transformer encoder\. SinceΦ\\Phiinevitably contains configuration\-specific noise introduced by the validation set and task model used for influence computation, DEC aims to learn a minimal sufficient representation:

\(10\)H∗=arg⁡maxH⁡\[I\(H;E\)−βI\(H;Φ\)\],H^\{\*\}=\\arg\\max\_\{H\}\\left\[I\(H;E\)\-\\beta\\,I\(H;\\Phi\)\\right\],whereI\(⋅;⋅\)I\(\\cdot;\\cdot\)denotes mutual information\. The first term ensures thatHHpreserves error\-specific semantics fromΦ\\Phi, while the second term compresses away irrelevant configuration noise\. Below we derive tractable surrogates for each term\.

#### MaximizingI\(H;E\)I\(H;E\)\.

The true posteriorp\(E\|H\)p\(E\|H\)is intractable, so we introduce the DEC prediction headqω\(E\|H\)q\_\{\\omega\}\(E\|H\)as a variational approximation\(Alemi et al\.,[2017](https://arxiv.org/html/2606.11616#bib.bib2); Miao et al\.,[2022](https://arxiv.org/html/2606.11616#bib.bib25)\):

\(11\)I\(H;E\)\\displaystyle I\(H;E\)=𝔼H,E\[log⁡p\(E\|H\)p\(E\)\]\\displaystyle=\\mathbb\{E\}\_\{H,E\}\\left\[\\log\\frac\{p\(E\|H\)\}\{p\(E\)\}\\right\]=𝔼H,E\[log⁡qω\(E\|H\)\]−𝔼E\[log⁡p\(E\)\]\\displaystyle=\\mathbb\{E\}\_\{H,E\}\\left\[\\log q\_\{\\omega\}\(E\|H\)\\right\]\-\\mathbb\{E\}\_\{E\}\\left\[\\log p\(E\)\\right\]\+𝔼H\[KL\(p\(E\|H\)∥qω\(E\|H\)\)\]\\displaystyle\\quad\+\\mathbb\{E\}\_\{H\}\\left\[\{\\rm KL\}\\left\(p\(E\|H\)\\\|q\_\{\\omega\}\(E\|H\)\\right\)\\right\]≥𝔼H,E\[log⁡qω\(E\|H\)\]−𝔼E\[log⁡p\(E\)\]\.\\displaystyle\\geq\\mathbb\{E\}\_\{H,E\}\\left\[\\log q\_\{\\omega\}\(E\|H\)\\right\]\-\\mathbb\{E\}\_\{E\}\\left\[\\log p\(E\)\\right\]\.Since−𝔼E\[log⁡p\(E\)\]\-\\mathbb\{E\}\_\{E\}\[\\log p\(E\)\]is constant during training, maximizing this lower bound reduces to maximizing𝔼\[log⁡qω\(E\|H\)\]\\mathbb\{E\}\[\\log q\_\{\\omega\}\(E\|H\)\]\. Because each error type is modeled as an independent Bernoulli variable, this is equivalent to minimizing the element\-wise BCE lossℒpred\\mathcal\{L\}\_\{\\rm pred\}\.

#### MinimizingI\(H;Φ\)I\(H;\\Phi\)\.

The compression termI\(H;Φ\)I\(H;\\Phi\)aims to discard the information inΦ\\Phithat is irrelevant to the error semanticsEE\. In our setting, the dominant source of such nuisance information is the configurationC=\(V,M\)C=\(V,M\)used to compute influence vectors, whereVVdenotes the validation subset andMMdenotes the task model\. Changing eitherVVorMMalters the numerical scale, sparsity, and local patterns ofΦ\\Phi, even when the underlying training sample and its error type remain unchanged\. Therefore, afterℒpred\\mathcal\{L\}\_\{\\rm pred\}preserves the error\-specific semantics inHH, we focus on suppressing the conditional configuration informationI\(H;C\|E\)I\(H;C\|E\)as the configuration\-related component of the compression objective\. By the chain rule of mutual information,

\(12\)I\(H;E,C\)=I\(H;E\)\+I\(H;C\|E\),I\(H;E,C\)=I\(H;E\)\+I\(H;C\|E\),so minimizing the configuration\-specific part amounts to minimizingI\(H;C\|E\)I\(H;C\|E\)\. This conditional mutual information admits the exact KL form:

\(13\)I\(H;C\|E\)=𝔼e,c\[KL\(p\(H\|E=e,C=c\)∥p\(H\|E=e\)\)\]\.I\(H;C\|E\)=\\mathbb\{E\}\_\{e,c\}\\left\[\{\\rm KL\}\\left\(p\(H\|E=e,C=c\)\\,\\\|\\,p\(H\|E=e\)\\right\)\\right\]\.I\(H;C\|E\)=0I\(H;C\|E\)=0if and only ifp\(H\|E=e,C=c\)=p\(H\|E=e\)p\(H\|E=e,C=c\)=p\(H\|E=e\)for all configurationscc, i\.e\., the representation is invariant to configurations within each error type\. SinceC=\(V,M\)C=\(V,M\), we decompose:

\(14\)I\(H;C\|E\)=I\(H;V\|E\)\+I\(H;M\|E,V\)\.I\(H;C\|E\)=I\(H;V\|E\)\+I\(H;M\|E,V\)\.

#### Connection toℒV\\mathcal\{L\}\_\{V\}andℒM\\mathcal\{L\}\_\{M\}\.

The validation\-invariant lossℒV\\mathcal\{L\}\_\{V\}minimizes an empirical surrogate ofI\(H;V\|E\)I\(H;V\|E\)by pulling together representations of the same sample computed from different validation subsets, while its contrastive negatives avoid representation collapse across different error types\. The model\-invariant lossℒM\\mathcal\{L\}\_\{M\}minimizes an empirical surrogate ofI\(H;M\|E,V\)I\(H;M\|E,V\)by directly aligning representations of the same sample computed from different task models\. Under the common local Gaussian approximationp\(H\|E=e,C=c\)=𝒩\(μe,c,σ2I\)p\(H\|E=e,C=c\)=\\mathcal\{N\}\(\\mu\_\{e,c\},\\sigma^\{2\}I\), these KL terms are proportional to squared distances between configuration\-specific representation means, which justifies the pairwise alignment forms used inℒV\\mathcal\{L\}\_\{V\}andℒM\\mathcal\{L\}\_\{M\}\. Therefore, DEC optimizes the IB objective by usingℒpred\\mathcal\{L\}\_\{\\rm pred\}to preserveI\(H;E\)I\(H;E\)and usingℒV,ℒM\\mathcal\{L\}\_\{V\},\\mathcal\{L\}\_\{M\}to suppress the configuration component ofI\(H;Φ\)I\(H;\\Phi\)\.

## Appendix DBaseline Workflows

As illustrated in Figure[8](https://arxiv.org/html/2606.11616#A2.F8), we compare DeMix against three representative baselines\. \(a\) DDC treats erroneous data as anomalies and applies type\-specific repair based on detection results\. \(b\) DDA\-select aggregates influence values over validation data into scalar scores and discards samples below a thresholdη\\eta\. \(c\) DDA\-repair extends DDA\-select with a heuristic repair pipeline: it selects detrimental samples, clusters their influence vectors with K\-Means, and assigns error types by comparing cluster centers with reference vectors generated from a clean calibration dataset with injected errors\. \(d\) DeMix simplifies this pipeline by using a parameterized DECgψg\_\{\\psi\}to directly map each influence vectorΦi\\Phi\_\{i\}to its error configurationt^\\hat\{t\}, enabling simultaneous diagnosis and type\-specific repair\.

## Appendix EHyperparameters

We conduct a systematic grid search to optimize the two invariance regularization weightsλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}in Eq\. \([9](https://arxiv.org/html/2606.11616#S3.E9)\)\. Table[6](https://arxiv.org/html/2606.11616#A5.T6)reports the debugging F1 \(%\) on the Adult dataset withα=0\.5\\alpha=0\.5\. Settingλ1=λ2=0\.1\\lambda\_\{1\}=\\lambda\_\{2\}=0\.1yields the best performance\. Higher regularization \(e\.g\., 1\) over\-prioritizes invariance at the expense of prediction accuracy, while lower values \(e\.g\., 0\.01\) diminish the information bottleneck benefits by retaining more configuration\-specific noise\.

Table 6\.Grid search overλ1\\lambda\_\{1\}andλ2\\lambda\_\{2\}on Adult \(α=0\.5\\alpha=0\.5\)\. Reported metric: debugging F1\-score \(%\)\.
DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

Similar Articles

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

FastMix: Fast Data Mixture Optimization via Gradient Descent

Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train (11 minute read)

CausalMix: Data Mixture as Causal Inference for Language Model Training

Submit Feedback

Similar Articles

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time
FastMix: Fast Data Mixture Optimization via Gradient Descent
Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train (11 minute read)
CausalMix: Data Mixture as Causal Inference for Language Model Training