Training Dynamics of Neural Software Defect Predictors under Coupled Data-Quality Issues

arXiv cs.LG Papers

Summary

This paper investigates how training dynamics of neural networks for software defect prediction are affected by coupled data-quality issues such as class imbalance and overlap, proposing an interaction-aware empirical protocol.

arXiv:2606.24968v1 Announce Type: new Abstract: Context: Software defect prediction supports maintenance decisions such as testing prioritization, release-risk assessment, and quality monitoring. However, metric-based SDP datasets often contain coupled data-quality issues, especially class imbalance and class overlap. Prior work has mainly measured their impact through endpoint performance, while recent evidence suggests that such issues may also appear in neural training dynamics (gradients, weights, biases, error trajectories). However, these studies examine issues in isolation, leaving open how internal neural network training patterns manifest when data quality issues are coupled. Objective: We investigate how training-dynamics patterns from class imbalance, overlap, and their coupling can be characterized under interaction-aware conditions in deep learning-based SDP. Method: We conduct a controlled intervention study on class-level UBD datasets, training a fixed MLP under imbalance-only, overlap-only, and joint conditions across five seeds. Training dynamics are logged per epoch; fidelity is monitored via coupling ratios. Patterns are characterized using effect sizes, trajectories, sensitivity analyses, and rule-based classification. Expected contribution: The study will produce an interaction-aware empirical protocol and a candidate taxonomy of training-dynamics patterns for coupled data-quality issues in metric-based SDP.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:08 AM

# Training Dynamics of Neural Software Defect Predictors under Coupled Data-Quality Issues
Source: [https://arxiv.org/html/2606.24968](https://arxiv.org/html/2606.24968)
###### Abstract

Context—Software defect prediction supports maintenance decisions such as testing prioritization, release\-risk assessment, and quality monitoring\. However, metric\-based SDP datasets often contain coupled data\-quality issues, especially class imbalance and class overlap\. Prior work has mainly measured their impact through endpoint performance, while recent evidence suggests that such issues may also appear in neural training dynamics \(gradients, weights, biases, error trajectories\)\. However, these studies examine issues in isolation, leaving open how internal neural network training patterns manifest when data quality issues are coupled\.

Objective—We investigate how training\-dynamics patterns from class imbalance, overlap, and their coupling can be characterized under interaction\-aware conditions in deep learning\-based SDP\.

Method—We conduct a controlled intervention study on class\-level UBD datasets\[[9](https://arxiv.org/html/2606.24968#bib.bib115)\], training a fixed MLP under imbalance\-only, overlap\-only, and joint conditions across five seeds\. Training dynamics are logged per epoch; fidelity is monitored via coupling ratios\. Patterns are characterized using effect sizes, trajectories, sensitivity analyses, and rule\-based classification\.

Expected contribution—The study will produce an interaction\-aware empirical protocol and a candidate taxonomy of training\-dynamics patterns for coupled data\-quality issues in metric\-based SDP\.

## IIntroduction

Software defect prediction \(SDP\) estimates which software artifacts are likely to contain defects\[[14](https://arxiv.org/html/2606.24968#bib.bib168)\]and supports testing prioritization, release\-risk assessment, inspection allocation, and defect management\. When maintenance teams use defect\-prediction outputs to allocate limited quality\-assurance effort, the question is not only whether a model achieves high predictive performance, but whether its behaviour can be trusted under the data conditions commonly found in software repositories\.

A persistent threat is training\-data quality\. Metric\-based SDP datasets often exhibit class imbalance \(defective instances are rarer\)\[[15](https://arxiv.org/html/2606.24968#bib.bib64),[8](https://arxiv.org/html/2606.24968#bib.bib169)\]and class overlap \(defective and non\-defective instances share similar feature\-space regions\)\[[11](https://arxiv.org/html/2606.24968#bib.bib62),[24](https://arxiv.org/html/2606.24968#bib.bib148)\]\. These issues can reduce performance, especially for the minority class, but their practical consequences extend beyond performance degradation\. If a defect\-prediction model behaves poorly, developers and researchers need to know whether the problem is likely caused by skewed class representation, ambiguous class boundaries, or their interaction\. Without this information, data cleaning, model debugging, and maintenance\-decision support remain largely trial\-and\-error\.

Most SDP studies evaluate data\-quality issues using endpoint metrics such as AUC, F1\-score, recall, precision, or balanced accuracy\[[13](https://arxiv.org/html/2606.24968#bib.bib167),[20](https://arxiv.org/html/2606.24968#bib.bib150),[15](https://arxiv.org/html/2606.24968#bib.bib64),[1](https://arxiv.org/html/2606.24968#bib.bib66),[7](https://arxiv.org/html/2606.24968#bib.bib55)\]\. These metrics show whether performance degrades, but not how the model learned or which data issue shaped learning\. For example, two datasets may produce similar F1\-score degradation even though one model struggled because defective instances were underrepresented, while another struggled because defective and non\-defective instances were difficult to separate\. In such cases, endpoint performance alone gives little guidance on whether practitioners should rebalance the data, reduce boundary ambiguity, revise labels, or treat the model as unreliable for the intended maintenance decision\.

Recent work suggests data\-quality problems may be reflected in training dynamics \(gradients, weights, biases, errors\)\[[3](https://arxiv.org/html/2606.24968#bib.bib165),[18](https://arxiv.org/html/2606.24968#bib.bib151)\], offering earlier diagnostic signals\. For SDP, this opens a direction where training behaviour may reveal sensitivity to particular data conditions\. However, this research direction is difficult because class imbalance and class overlap are not independent in practice\. Changing class proportions through sampling may also change neighbourhood structure and apparent boundary ambiguity\. Conversely, moving or removing boundary\-near instances to modify overlap may also affect class proportions\. Therefore, a training\-dynamics pattern observed under an intervention planned to manipulate imbalance or overlap may not reflect the intended issue alone\. It may also reflect unintended changes in the other issue or an interaction between the two\. This creates a methodological limitation in existing studies using training dynamics as evidence for data\-quality diagnosis in neural SDP models\.

This study addresses that limitation by proposing an interaction\-aware intervention study of class imbalance, class overlap, and their coupling in metric\-based SDP\. We will use class\-level SDP datasets from the Unified Bug Dataset\[[9](https://arxiv.org/html/2606.24968#bib.bib115)\], apply controlled imbalance\-only, overlap\-only, joint imbalance–overlap, and progressive\-reduction conditions, and train a fixed multilayer perceptron under repeated random seeds\. After every intervention, we will recompute realized class imbalance and overlap metric values to assess whether the intended data\-quality issue changed and whether the non\-target issue drifted\. We will then analyze predefined training\-dynamics metric families, including error dynamics, gradient magnitude and propagation, gradient distribution shape, and weight and bias summaries\.

The study aims to catalog candidate training\-dynamics patterns associated with imbalance, overlap, and their coupling, classify them as issue\-associated, shared, or joint\-condition patterns, and provide a reproducible SDP training\-dynamics protocol\. This is a necessary step toward practical data\-quality debugging for neural defect\-prediction pipelines\.

## IIRelated Work

### II\-ASoftware Defect Prediction and Data Quality

Public SDP datasets such as PROMISE and UBD have enabled comparative defect\-prediction research\[[2](https://arxiv.org/html/2606.24968#bib.bib152),[9](https://arxiv.org/html/2606.24968#bib.bib115)\]\. Prior work shows that model conclusions are sensitive to dataset properties, learner choice, evaluation metrics, and preprocessing\[[12](https://arxiv.org/html/2606.24968#bib.bib58),[19](https://arxiv.org/html/2606.24968#bib.bib154),[20](https://arxiv.org/html/2606.24968#bib.bib150)\], making data\-quality issues such as imbalance and overlap important for both performance and interpretation\.

### II\-BClass Imbalance, Class Overlap, and Their Coupling

Class imbalance is widely studied in SDP because defective instances are often rare, motivating SMOTE and undersampling\[[4](https://arxiv.org/html/2606.24968#bib.bib161),[23](https://arxiv.org/html/2606.24968#bib.bib162)\]\. Class overlap has also gained attention because defective and non\-defective modules may share similar metric values\[[11](https://arxiv.org/html/2606.24968#bib.bib62),[6](https://arxiv.org/html/2606.24968#bib.bib134)\]\. More recent work suggests that imbalance and overlap should not always be treated as independent data\-quality issues as rebalancing can alter neighbourhood structure, overlap handling can affect class proportions, and combined treatments may change both performance and interpretation\[[22](https://arxiv.org/html/2606.24968#bib.bib149),[24](https://arxiv.org/html/2606.24968#bib.bib148)\]\. These studies motivate our focus on coupling\. However, most existing work still evaluates imbalance and overlap primarily through endpoint performance measures\[[17](https://arxiv.org/html/2606.24968#bib.bib121),[20](https://arxiv.org/html/2606.24968#bib.bib150),[6](https://arxiv.org/html/2606.24968#bib.bib134)\]or feature\-importance stability\. In contrast, our study will examine whether training\-dynamics patterns vary when imbalance and overlap are manipulated separately, jointly, and under monitored non\-target drift\.

### II\-CFrom Endpoint Metrics to Training Dynamics

Endpoint metrics are necessary for evaluating defect predictors\[[20](https://arxiv.org/html/2606.24968#bib.bib150),[1](https://arxiv.org/html/2606.24968#bib.bib66)\], but they do not show how neural models learn under problematic data conditions\. Recent data\-bug work in software engineering uses training logs, gradients, weights, biases, and trajectories to study data\-quality effects in deep learning\[[18](https://arxiv.org/html/2606.24968#bib.bib151)\]\. This line of work provides an important motivation for treating training dynamics as possible evidence of data\-condition sensitivity\. Nevertheless, we are not aware of prior SDP work that combines single\-issue, joint, and drift\-monitored intervention protocols for candidate training\-dynamics classification\. Our study addresses this gap by measuring the realized severity of imbalance and overlap after each intervention, monitoring non\-target drift, and applying progressive reduction and rule\-based classification to interpret candidate patterns\.

## IIIResearch Questions

Our investigation is guided by three open research questions:

RQ1\. How do candidate training\-dynamics patterns change when class imbalance and class overlap are manipulated separately under controlled, fidelity\-monitored interventions?

*Rationale\.*This RQ will characterize severity\-response patterns under imbalance\-only and overlap\-only intervention paths\. We will not assume that either issue is perfectly isolated; instead, we will measure the realized severity of both imbalance and overlap after every intervention and interpret patterns in light of any non\-target drift\.

RQ2\. How do candidate training\-dynamics patterns vary when imbalance and overlap are introduced jointly?

*Rationale\.*This RQ examines whether single\-issue patterns appear together, weaken, dominate, or change under joint intervention\. Although C2 and N1 are both bounded in \[0,1\], they measure different constructs\. Therefore, we will use low, medium, and high levels to structure the joint intervention, but we will interpret the results using the realized post\-intervention imbalance and overlap measurements rather than assuming direct equivalence between the two metrics\.

RQ3\. How do candidate patterns change under progressive issue reduction?

*Rationale\.*Single\-direction intervention alone may produce patterns that are sensitive to the intervention mechanism\. This RQ uses progressive reduction as triangulation: patterns are more credible if they change under injection and weaken under reduction\. We do not interpret this as causal reversal, but as evidence about pattern robustness\.

## IVDatasets and Preprocessing

### IV\-ADataset Source and Unit of Analysis

We will use the Unified Bug Dataset \(UBD\), a public curated benchmark for Java bug prediction that consolidates metric\-based datasets from five public sources and recomputes a common set of source\-code metrics\[[9](https://arxiv.org/html/2606.24968#bib.bib115)\]\. UBD provides both class\-level and file\-level versions\. To avoid mixing granularities and to reduce duplicate representations of the same systems, the primary analysis will use the*class\-level*UBD datasets only\. Each dataset will be treated as an independent binary SDP task in which the unit of prediction is a Java class and the target indicates whether the class is defective\. File\-level UBD datasets will not be mixed into the primary analysis; if used at all, they will be reported separately as a robustness extension rather than as part of the main dataset pool\.

We will start from the full set of candidate class\-level UBD CSV files available in our dataset pool\. If duplicate representations of the same project, version, and granularity are present, we will retain one canonical UBD representation and remove exact duplicates\. Different releases or versions of the same project will be retained as separate datasets because they represent different prediction contexts\.

### IV\-BInclusion and Exclusion Criteria

The inclusion and exclusion criteria will be applied before model training and before inspecting any training\-dynamics outcomes\. A dataset will be included in the primary analysis only if it is a class\-level UBD CSV dataset with a binary target variable, numeric or Boolean predictors, at least 200 instances before splitting, and at least 50 original instances in each class\. Identifier columns and non\-feature metadata will be removed\. Labels must be encodable as 0 and 1, and stratified train/validation/test splitting must preserve both classes in all partitions\.

Each intervention condition must retain at least 20 training instances per class after injection or progressive reduction and must contain only finite feature values\. Entirely missing, constant, or non\-finite feature columns will be removed; if this makes a dataset or condition invalid, the affected dataset or condition will be excluded and logged with an explicit reason\. If a dataset is valid for some intervention families but invalid for others, it will be retained only for the RQ\-specific analyses for which the required intervention conditions are valid\. Based on the current UBD candidate pool and script assumptions, we expect approximately 40–50 datasets to remain eligible, but the final number will be determined only after applying these registered filtering rules\.

The 200\-instance and 50\-per\-class pre\-intervention floors make the 20\-per\-class post\-intervention floor achievable after severe undersampling or editing; the latter is the minimum we consider viable for mini\-batch training and nearest\-neighbour overlap measurement\. Sensitivity will vary this floor to 15 and 25\.

### IV\-CSplitting and Preprocessing

Each dataset will be processed independently using a fixed stratified split with seed 42: 20% test, then 20% of the remaining data for validation, yielding approximately 64/16/20 train/validation/test partitions\. The same validation and test sets will be reused across interventions and seeds\. Median imputation andStandardScalerwill be fitted on the original training partition only and applied to all partitions; Boolean features will be encoded as 0/1 before scaling\. Interventions will modify only transformed training data, never validation or test data\. This design ensures that differences across intervention conditions reflect changes in the training data rather than changes in the evaluation data or data\-leakage from validation/test partitions\.

## VExecution Plan

### V\-AData\-Quality Measurement

For each dataset and intervention condition, we will compute two data\-quality measures on the preprocessed training data only\. Class imbalance will be measured using C2, the imbalance\-ratio complexity measure\[[16](https://arxiv.org/html/2606.24968#bib.bib14)\]\. Forncn\_\{c\}classes,nin\_\{i\}instances in classii, andnntotal instances, we compute:

I​R=nc−1nc​∑i=1ncnin−ni,C​2=1−1I​R\.IR=\\frac\{n\_\{c\}\-1\}\{n\_\{c\}\}\\sum\_\{i=1\}^\{n\_\{c\}\}\\frac\{n\_\{i\}\}\{n\-n\_\{i\}\},\\qquad C2=1\-\\frac\{1\}\{IR\}\.
Class overlap will be measured using N1, the fraction of borderline points\[[16](https://arxiv.org/html/2606.24968#bib.bib14)\]\. We will compute N1 using Euclidean distances over standardized training features\. LetG=\(V,E\)G=\(V,E\)be the minimum spanning tree over the training instances, where each vertexi∈Vi\\in Vrepresents one training instancexix\_\{i\}with class labelyiy\_\{i\}, and each edge\(i,j\)∈E\(i,j\)\\in Econnects two instances in the tree\. N1 is the fraction of instances that are incident to at least one MST edge connecting opposite\-class instances:

N​1=\|\{i∈V∣∃j∈V:\(i,j\)∈E∧yi≠yj\}\|n\.N1=\\frac\{\\left\|\\left\\\{i\\in V\\mid\\exists j\\in V:\(i,j\)\\in E\\land y\_\{i\}\\neq y\_\{j\}\\right\\\}\\right\|\}\{n\}\.
Here,n=\|V\|n=\|V\|is the number of training instances\. C2 will be the target metric for imbalance interventions, and N1 will be the target metric for overlap interventions\. Both metrics will be recomputed after every intervention condition\. The non\-target metric will be used to monitor drift; for example, N1 drift during imbalance intervention and C2 drift during overlap intervention\. Although both metrics are bounded between 0 and 1, they measure different constructs, so equal numeric values will not be treated as equivalent severity\.

### V\-BModel Architecture and Training Protocol

We will use a fixed feed\-forward multilayer perceptron \(MLP\) for all datasets and intervention conditions\. The input size will match the number of preprocessed predictors; hidden layers will have 128, 64, and 32 units with ReLU activations and dropout \(p=0\.2p=0\.2\); the output layer will produce two logits\. Weights will be initialized with Kaiming normal initialization and biases with zeros\. Models will be trained in PyTorch using cross\-entropy loss, Adam \(l​r=10−3lr=10^\{\-3\}, weight decay10−410^\{\-4\}\), shuffled mini\-batches of size 64, and 100 fixed epochs\. We will not use class weights, class\-balanced batch sampling, or early stopping in the primary analysis, because these mechanisms could partially counteract the intended imbalance interventions or make trajectories incomparable across conditions\. Dropout will be active during training and disabled during validation/test evaluation\. We keep the architecture fixed to avoid confounding data\-quality intervention effects with model\-selection decisions\.

Each intervention\-generated training set will be trained with seeds\{42,43,44,45,46\}\\\{42,43,44,45,46\\\}\. Training and validation metrics will be logged each epoch; test metrics will be computed once after training\. Test metrics are reported only as endpoint context and are not used for candidate pattern classification\. Gradient summaries will be recorded after backpropagation and before the optimizer update, then aggregated to epoch\-level summaries\. We will fix Python, NumPy, and PyTorch seeds, enable deterministic PyTorch settings where feasible, and report package versions, CUDA/device information, and hardware in the artifact package\.

As a robustness check, we train a secondary \[256\-128\-64\] MLP under identical optimization, regularization, and hyperparameter settings, isolating architectural depth as the variable of interest\. Future work may examine interactions between training configuration and data\-quality symptoms\.

### V\-CFault\-Injection Protocols

#### V\-C1Class Imbalance Fault\-Injection Procedure

The imbalance\-injection protocol will construct a controlled low\-imbalance reference and then progressively reduce the minority class\. The reference will be created by random oversampling with replacement on the preprocessed training data only, using random state 123, until the minority class matches the majority\-class count\. This reference will not be treated as natural or clean; it is only a controlled starting point for manipulating class ratios\.

Starting from this reference, we will keep the majority class fixed and undersample without replacement from the oversampled minority pool\. For each severity level

α∈\{0\.00,0\.25,0\.50,0\.75,1\.00\},\\alpha\\in\\\{0\.00,0\.25,0\.50,0\.75,1\.00\\\},the target majority/minority ratio will be

r​\(α\)=1\+α​\(5−1\),r\(\\alpha\)=1\+\\alpha\(5\-1\),whereα=0\\alpha=0corresponds to the low\-imbalance reference andα=1\\alpha=1corresponds to an intended 5:1 majority/minority ratio\. The 5:1 upper bound is chosen to create a clear imbalance gradient while avoiding extreme minority depletion that would make stratified evaluation, nearest\-neighbour measures, and neural training unstable\.

For each severity level, the target minority count will be

nmin​\(α\)=max⁡\(⌈nmajr​\(α\)⌉,20\),n\_\{\\min\}\(\\alpha\)=\\max\\left\(\\left\\lceil\\frac\{n\_\{\\mathrm\{maj\}\}\}\{r\(\\alpha\)\}\\right\\rceil,20\\right\),wherenmajn\_\{\\mathrm\{maj\}\}is the fixed majority\-class count\. The floor of 20 training instances per class is used to avoid intervention conditions that are too small for stratified evaluation, neighbourhood\-based overlap measurement, and neural\-network training\. Conditions that cannot satisfy this floor will be marked invalid for the corresponding analysis\. After every imbalance condition, we will recompute realized imbalance \(C2\) and overlap \(N1\) measurements to assess intervention fidelity and non\-target drift\.

#### V\-C2Class Overlap Fault\-Injection Procedure

The overlap\-injection protocol will be treated as a controlled boundary\-ambiguity stressor, not as a claim of naturally occurring overlap\. We will first construct a lower\-overlap reference using Repeated Edited Nearest Neighbours \(RENN\) on standardized training\[[21](https://arxiv.org/html/2606.24968#bib.bib163)\]\. The method parameters will be fixed before execution and will not be tuned per dataset\. If RENN removes too many instances, leaves a single\-class training set, or violates the minimum class floor, the dataset will be marked invalid for overlap\-injection analysis\.

Starting from the lower\-overlap reference, we will identify boundary\-near candidates separately within each class\. For each instancexix\_\{i\}, we will find its nearest opposite\-class neighbourxo​p​p​\(i\)x\_\{opp\(i\)\}using Euclidean distance over standardized training features\. Within each class, the closest 25% of instances to an opposite\-class neighbour will form the candidate pool\. This class\-wise rule prevents the candidate pool from being dominated by the majority class while focusing the perturbation on boundary\-near regions\.

For each severity level

α∈\{0\.00,0\.25,0\.50,0\.75,1\.00\},\\alpha\\in\\\{0\.00,0\.25,0\.50,0\.75,1\.00\\\},we will modify a nested fractionα\\alphaof the candidate pool\. A selected instance will be moved toward its nearest opposite\-class neighbour while retaining its original label:

xi′=\(1−γ\)​xi\+γ​xo​p​p​\(i\),γ=0\.80\.x\_\{i\}^\{\\prime\}=\(1\-\\gamma\)x\_\{i\}\+\\gamma x\_\{opp\(i\)\},\\quad\\gamma=0\.80\.The valueγ=0\.80\\gamma=0\.80is used as a strong but bounded perturbation: it moves selected samples close to the opposite\-class region without replacing them by the opposite\-class instance\. Because labels are preserved, the procedure may introduce label–feature tension; we therefore interpret it as a boundary\-ambiguity stressor and discuss this as a construct\-validity threat\.

After every overlap condition, we will recompute realized overlap \(N1\) and imbalance \(C2\) measurements\. A dataset will be included in the primary overlap\-injection analysis only if realized overlap increases with severity according to the intervention\-fidelity check; otherwise, it will be flagged for sensitivity analysis or excluded from the overlap\-specific analysis\. Conditions that produce non\-finite values or violate the minimum class floor will be marked invalid\. We will also run a sensitivity analysis withγ=0\.60\\gamma=0\.60to assess whether candidate patterns depend on the chosen perturbation strength\.

#### V\-C3Joint Imbalance–Overlap Intervention

The joint intervention addresses RQ2 by examining candidate training\-dynamics patterns when imbalance and overlap are introduced together\. We will use a factorial intervention design\[[10](https://arxiv.org/html/2606.24968#bib.bib170)\], crossing three nominal imbalance levels with three nominal overlap levels:

αI,αO∈\{0\.00,0\.50,1\.00\}\.\\alpha\_\{I\},\\alpha\_\{O\}\\in\\\{0\.00,0\.50,1\.00\\\}\.This yields nine joint conditions per dataset\. The levels are nominal design levels only; they do not imply equal substantive severity across imbalance and overlap\. Therefore, after each joint condition, we will recompute the realized imbalance and overlap measurements and use these realized values, rather than nominalα\\alphalabels alone, in the analysis\.

The primary intervention order will be imbalance first, followed by overlap, because the overlap perturbation preserves labels and is therefore less likely to directly change the class\-count ratio after imbalance has been established\. For each joint condition, we will record nominal levels, realized data\-quality measurements, class counts, modified samples, and non\-target drift\. Conditions violating the minimum class floor, producing non\-finite values, or failing fidelity checks will be marked invalid\. As an order\-sensitivity check, we will repeat the joint intervention in reverse order for up to ten retained datasets selected to cover low, medium, and high original imbalance/overlap profiles where available\.

### V\-DProgressive Reduction Interventions

Progressive reduction addresses RQ3 by providing convergent exploratory evidence about whether candidate patterns change when the targeted data\-quality issue is reduced\. We will not describe these procedures as cleaning, reversal, or causal validation, because both procedures introduce their own artifacts\. Instead, they are treated as additional interventions that reduce the measured severity of imbalance or overlap under a fixed protocol\.

For imbalance reduction, we will start from the original preprocessed training partition and progressively increase the minority class toward the majority\-class count using SMOTE\[[4](https://arxiv.org/html/2606.24968#bib.bib161)\]\. For each reduction level

α∈\{0\.00,0\.25,0\.50,0\.75,1\.00\},\\alpha\\in\\\{0\.00,0\.25,0\.50,0\.75,1\.00\\\},the target minority count will be

nmin′​\(α\)=nmin\+⌊α​\(nmaj−nmin\)⌋,n\_\{\\mathrm\{min\}\}^\{\\prime\}\(\\alpha\)=n\_\{\\mathrm\{min\}\}\+\\left\\lfloor\\alpha\(n\_\{\\mathrm\{maj\}\}\-n\_\{\\mathrm\{min\}\}\)\\right\\rfloor,wherenminn\_\{\\mathrm\{min\}\}andnmajn\_\{\\mathrm\{maj\}\}are the original minority\- and majority\-class counts in the training partition\. SMOTE will usek=5k=5neighbours and random state 123; if the minority class is too small,kkwill be reduced tonmin−1n\_\{\\mathrm\{min\}\}\-1\. The majority class will remain unchanged\. Each condition must retain at least 20 training instances per class\.

For overlap reduction, we will use a RENN\-based progressive editing procedure on standardized training features\[[21](https://arxiv.org/html/2606.24968#bib.bib163)\]\. RENN will first identify candidate majority\-class instances located in locally ambiguous neighbourhoods\. These candidates will be ranked by boundary ambiguity, operationalized by their distance to the nearest opposite\-class neighbour and local class disagreement\. For eachα\\alpha, we will remove a nested fractionα\\alphaof these candidates, while preserving the minimum floor of 20 training instances per class\. Conditions that remove too many samples, produce a single\-class training set, or generate non\-finite values will be marked invalid\.

After every progressive reduction condition, we will recompute realized imbalance \(C2\) and overlap \(N1\) measurements\. Progressive reduction is used as a triangulation step, not as causal reversal\. A candidate pattern is considered more credible when its direction under injection is directionally opposed under the corresponding reduction intervention, provided non\-target drift remains below the pre\-specified threshold\.

### V\-EIntervention\-Fidelity and Non\-Target Drift Checks

After each intervention condition, we will recompute the realized imbalance and overlap metrics on the modified training data\. A single\-issue intervention path will be treated as fidelity\-consistent only if the target metric changes in the expected direction with Spearman’s\|ρ\|≥0\.70\|\\rho\|\\geq 0\.70and changes by at least 0\.10 between its lowest and highest realized values across severity levels\. Non\-target drift will be flagged when the non\-target metric changes by at least 0\.10 relative to the reference condition\. We will also compute a coupling ratio,

C​R=\|Δ​N\|\|Δ​T\|\+10−6,CR=\\frac\{\|\\Delta N\|\}\{\|\\Delta T\|\+10^\{\-6\}\},whereΔ​N\\Delta NandΔ​T\\Delta Tare the non\-target and target metric changes, respectively\. Conditions withC​R≥0\.50CR\\geq 0\.50will be flagged as strongly coupled\. Candidate pattern categories will be checked with and without high\-drift or strongly coupled conditions; categories that change will be reported as drift\-sensitive\. The coupling\-ratio threshold of 0\.5 flags conditions where non\-target drift exceeds half the intended change, representing a point at which the non\-target metric shift is large enough to contaminate symptom attribution materially\. Sensitivity will be checked using±\\pm0\.10\.

### V\-FTraining\-Dynamics Logging

Following Shah et al\.’s study on data bugs in deep learning models for software engineering\[[18](https://arxiv.org/html/2606.24968#bib.bib151)\], we will log training\-time behaviour through error trajectories and model\-internal statistics\. Shah et al\. use training logs, gradients, weights, and biases to study symptoms of data\-quality and preprocessing issues in SE deep learning tasks\[[18](https://arxiv.org/html/2606.24968#bib.bib151)\]\. We adapt this idea to metric\-based SDP by tracking four predefined metric families: \(i\) error dynamics, including training error, validation error, and the train–validation gap; \(ii\) gradient magnitude and propagation, including first\-, middle\-, and last\-layer gradient RMS and the first\-to\-last gradient RMS ratio; \(iii\) gradient distribution shape, including skewness, kurtosis, and near\-zero gradient proportion for selected layers; and \(iv\) parameter statistics, including selected weight and bias summaries\.

Training and validation metrics will be logged at every epoch\. Gradient summaries will be computed after backpropagation and before the optimizer update, then aggregated to epoch\-level summaries\. Run\-level analyses will summarize epoch\-level metrics using a fixed late\-training window, operationalized as the final 20% of epochs\. This pragmatic choice reduces sensitivity to single\-epoch noise while keeping the summary focused on late\-training behaviour\.

### V\-GPattern Classification and Statistical Analysis

We will treat all pattern categories as exploratory candidates, not causal diagnoses\. Because multiple metrics are logged, isolated metric changes will not be interpreted as standalone evidence\. For each dataset, intervention condition, seed, and logged metric, we will summarize epoch\-level values using fixed window summaries described above\. For each metric, we will compute the signed standardized change between the low\-severity reference condition and higher\-severity conditions\. We will treat\|d\|≥0\.5\|d\|\\geq 0\.5as a non\-trivial standardized change, following the conventional medium\-effect threshold for Cohen\-style standardized effects\[[5](https://arxiv.org/html/2606.24968#bib.bib164)\]\. This floor is preferred over\|d\|≥0\.2\|d\|\\geq 0\.2to avoid classifying substantively negligible changes as candidate patterns across the large number of metrics examined\. Smaller effects will be reported descriptively but will not be used to assign exploratory pattern categories\.

A candidate pattern will be considered direction\-consistent if at least 70% of eligible datasets show the same direction of change, while fewer than 20% show the reverse direction\. Datasets with negligible changes will be treated as neutral rather than opposite\. The 70% super\-majority and 20% reverse ceiling together ensure classification is not driven by a narrow dataset subset nor assigned when a non\-trivial minority shows the opposite pattern\.

Within a dataset, the pattern must also be seed\-stable: at least four of the five training seeds must show the same sign of change\. This near\-unanimous criterion ensures the pattern is stable under within\-condition stochastic variation in initialization and mini\-batch ordering, rather than reflecting a single unlucky seed\. Conditions with strong non\-target drift will be flagged before classification\. We define strong non\-target drift as an absolute change of at least 0\.10 in the non\-target data\-quality metric\. Since C2 and N1 are bounded in\[0,1\]\[0,1\], this represents a 10\-percentage\-point change\. The 0\.10 drift threshold is used only as a within\-metric operational flag for substantial change, not as a claim that C2 and N1 have equivalent substantive severity\. We will therefore interpret drift using both the absolute within\-metric change and the coupling ratio, and will report sensitivity analyses using 0\.05 and 0\.15 thresholds\. Candidate pattern categories must remain qualitatively unchanged when high\-drift conditions are excluded; otherwise, the pattern will be categorized as drift\-sensitive\.

We will use the following exploratory pattern categories\. A pattern will be categorized as*candidate imbalance\-associated*if it satisfies the effect\-size, direction\-consistency, and seed\-stability rules under imbalance\-only intervention, but not under overlap\-only intervention\. A pattern will be categorized as*candidate overlap\-associated*using the analogous rule for overlap\-only intervention\. A pattern will be categorized as*shared*if it satisfies the effect\-size, direction\-consistency, and seed\-stability rules under both imbalance\-only and overlap\-only interventions, has the same direction of change, and has comparable magnitude\. We operationalize comparable magnitude as a ratio of absolute standardized effects between 0\.5 and 2\.0, i\.e\., neither effect is more than twice the other\. The 0\.5–2\.0 bounds are pre\-specified to exclude asymmetric effects where one issue drives the pattern substantially more than the other; no theoretical derivation is claimed\. For joint conditions, we compute an interaction residual by comparing the observed joint standardized change with the sum of the corresponding single\-issue changes:

δm=dmj​o​i​n​t​\(αI,αO\)−\[dmi​m​b​\(αI\)\+dmo​v​\(αO\)\]\.\\delta\_\{m\}=d^\{joint\}\_\{m\}\(\\alpha\_\{I\},\\alpha\_\{O\}\)\-\[d^\{imb\}\_\{m\}\(\\alpha\_\{I\}\)\+d^\{ov\}\_\{m\}\(\\alpha\_\{O\}\)\]\.A pattern is categorized as*interaction\-dependent*when\|δm\|≥0\.5\|\\delta\_\{m\}\|\\geq 0\.5and the direction\-consistency and seed\-stability rules are satisfied\. A pattern will be categorized as*masked/dominated*if a single\-issue pattern falls below\|d\|<0\.2\|d\|<0\.2or changes direction when the other issue is introduced jointly\. In this study, an informative outcome need not contain many stable pattern categories\. A sparse or empty set of stable categories will be treated as evidence that the logged dynamics do not support robust issue\-associated interpretation under this protocol\. High fidelity failure, strong non\-target drift, or unstable classifications will likewise be reported as evidence about protocol feasibility and limits\.

## VIThreats to Validity

Construct validity\.The interventions are controlled stressors rather than natural reproductions of SDP data\-quality problems\. Imbalance injection may reflect artefacts of random oversampling, undersampling, and changed class exposure; progressive imbalance reduction may reflect SMOTE artefacts because SMOTE creates synthetic minority instances rather than recovering natural data\. We partly address this threats by reporting realized C2 values, fidelity checks, non\-target drift, coupling ratios, class counts, and training\-set sizes\. Overlap injection may create label–feature tension, so overlap\-associated patterns may partly reflect label–feature inconsistency rather than natural boundary ambiguity\. We partly address this by: \(i\) treating the intervention as a controlled stressor, not a natural reproduction; \(ii\) using progressive reduction from natural overlap as bidirectional triangulation; and \(iii\) testing sensitivity toγ=0\.60\\gamma=0\.60\.

Internal and conclusion validity\.Training dynamics may be influenced by mini\-batch composition, local density changes, training\-set size, stochastic optimization, and the fixed MLP architecture, not only by the targeted data\-quality issue\. N1 may also be unstable in small or high\-dimensional datasets because it relies on nearest\-neighbour structure\.

External validity and reproducibility\.The findings will be limited to metric\-based SDP datasets and one neural learner; they should therefore be interpreted as candidate patterns for this setting rather than universal symptoms of imbalance or overlap\. Results may not transfer to just\-in\-time defect prediction, process metrics, or code\-token models\. Reproducibility may also be affected by software versions, hardware, CUDA operations, and framework nondeterminism, so we will report package versions, device information, seeds, configurations, intervention logs, and realized data\-quality measurements\.

## VIIConclusion

This study proposes an interaction\-aware empirical protocol for examining how coupled data\-quality issues may shape neural training behaviour in software defect prediction\. The protocol systematically manipulates class imbalance and class overlap, both separately and jointly, while monitoring whether an intervention targeting one issue unintentionally changes the other\. In this way, the study moves beyond endpoint performance evaluation by examining training\-time behaviour, including gradients, weights, biases, and training/validation error trajectories\.

Rather than treating observed training\-dynamics patterns as issue\-specific by default, the protocol categorizes them as candidate imbalance\-associated, overlap\-associated, shared, interaction\-dependent or masked/dominated patterns\. The resulting evidence of the final study will provide a transparent and reproducible basis for understanding how coupled data\-quality issues affect neural software defect prediction models and may inform future diagnostic tools for software maintenance and evolution\.

## Declaration of AI Assistance

The authors used ChatGPT to support language editing\. The authors reviewed, revised, and take responsibility for all content\.

## References

- \[1\]K\. Bhandari, K\. Kumar, and A\. L\. Sangal\(2023\-08\)Data quality issues in software fault prediction: a systematic literature review\.Artificial Intelligence Review56\(8\),pp\. 7839–7908\(en\)\.External Links:ISSN 0269\-2821, 1573\-7462,[Link](https://link.springer.com/10.1007/s10462-022-10371-6),[Document](https://dx.doi.org/10.1007/s10462-022-10371-6)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p3.1),[§II\-C](https://arxiv.org/html/2606.24968#S2.SS3.p1.1)\.
- \[2\]B\. Caglayan, E\. Kocaguneli, J\. Krall, F\. Peters, and B\. Turhan\(2012\-01\)The promise repository of empirical software engineering data\.Cited by:[§II\-A](https://arxiv.org/html/2606.24968#S2.SS1.p1.1)\.
- \[3\]J\. Cao, M\. Li, X\. Chen, M\. Wen, Y\. Tian, B\. Wu, and S\. Cheung\(2022\)DeepFD: automated fault diagnosis and localization for deep learning programs\.InProceedings of the 44th International Conference on Software Engineering,ICSE ’22,New York, NY, USA,pp\. 573–585\.External Links:ISBN 9781450392211,[Link](https://doi.org/10.1145/3510003.3510099),[Document](https://dx.doi.org/10.1145/3510003.3510099)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p4.1)\.
- \[4\]N\. V\. Chawla, K\. W\. Bowyer, L\. O\. Hall, and W\. P\. Kegelmeyer\(2002\-06\-01\)SMOTE: synthetic minority over\-sampling technique\.16,pp\. 321–357\.External Links:ISSN 1076\-9757,[Link](https://www.jair.org/index.php/jair/article/view/10302),[Document](https://dx.doi.org/10.1613/jair.953)Cited by:[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1),[§V\-D](https://arxiv.org/html/2606.24968#S5.SS4.p2.6)\.
- \[5\]J\. Cohen\(1988\)Statistical power analysis for the behavioral sciences\.2nd ed edition,L\. Erlbaum Associates\.External Links:ISBN 978\-0\-8058\-0283\-2Cited by:[§V\-G](https://arxiv.org/html/2606.24968#S5.SS7.p1.2)\.
- \[6\]E\. C\. Dapaah and J\. Grabowski\(2025\)When data quality issues collide: a large\-scale empirical study of co\-occurring data quality issues in software defect prediction\.External Links:2512\.17460,[Link](https://arxiv.org/abs/2512.17460)Cited by:[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1)\.
- \[7\]J\. Eberlein, D\. Rodriguez, and R\. Harrison\(2025\-01\)The effect of data complexity on classifier performance\.Empirical Software Engineering30\(1\),pp\. 16\(en\)\.External Links:ISSN 1382\-3256, 1573\-7616,[Link](https://link.springer.com/10.1007/s10664-024-10554-5),[Document](https://dx.doi.org/10.1007/s10664-024-10554-5)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p3.1)\.
- \[8\]K\. J\. Eldho\(2022\-02\-16\)Impact of unbalanced classification on the performance of software defect prediction models\.15\(6\),pp\. 237–242\.External Links:ISSN 0974\-5645,[Link](https://indjst.org/),[Document](https://dx.doi.org/10.17485/IJST/v15i6.2193)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p2.1)\.
- \[9\]R\. Ferenc, Z\. Tóth, G\. Ladányi, I\. Siket, and T\. Gyimóthy\(2018\-10\)A Public Unified Bug Dataset for Java\.InProceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering,Oulu Finland,pp\. 12–21\(en\)\.External Links:ISBN 978\-1\-4503\-6593\-2,[Link](https://dl.acm.org/doi/10.1145/3273934.3273936),[Document](https://dx.doi.org/10.1145/3273934.3273936)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p5.1),[§II\-A](https://arxiv.org/html/2606.24968#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2606.24968#S4.SS1.p1.1)\.
- \[10\]\(2006\)Fundamentals of factorial designs\.InA Modern Theory of Factorial Designs,pp\. 9–48\.External Links:ISBN 978\-0\-387\-37344\-7,[Document](https://dx.doi.org/10.1007/0-387-37344-6%5F2),[Link](https://doi.org/10.1007/0-387-37344-6_2)Cited by:[§V\-C3](https://arxiv.org/html/2606.24968#S5.SS3.SSS3.p1.2)\.
- \[11\]L\. Gong, H\. Zhang, J\. Zhang, M\. Wei, and Z\. Huang\(2023\-04\)A Comprehensive Investigation of the Impact of Class Overlap on Software Defect Prediction\.IEEE Transactions on Software Engineering49\(4\),pp\. 2440–2458\.External Links:ISSN 1939\-3520,[Link](https://ieeexplore.ieee.org/document/9944157/),[Document](https://dx.doi.org/10.1109/TSE.2022.3220740)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p2.1),[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1)\.
- \[12\]T\. Hall, S\. Beecham, D\. Bowes, D\. Gray, and S\. Counsell\(2012\-11\)A Systematic Literature Review on Fault Prediction Performance in Software Engineering\.IEEE Transactions on Software Engineering38\(6\),pp\. 1276–1304\.External Links:ISSN 1939\-3520,[Link](https://ieeexplore.ieee.org/document/6035727/),[Document](https://dx.doi.org/10.1109/TSE.2011.103)Cited by:[§II\-A](https://arxiv.org/html/2606.24968#S2.SS1.p1.1)\.
- \[13\]M\. A\. Kabir, J\. W\. Keung, K\. E\. Bennin, and M\. Zhang\(2019\)Assessing the significant impact of concept drift in software defect prediction\.In2019 IEEE 43rd Annual Computer Software and Applications Conference \(COMPSAC\),Vol\.1,pp\. 53–58\.External Links:[Document](https://dx.doi.org/10.1109/COMPSAC.2019.00017)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p3.1)\.
- \[14\]Z\. Li, J\. Niu, and X\. Jing\(2024\-02\-27\)Software defect prediction: future directions and challenges\.31\(1\),pp\. 19\.External Links:ISSN 1573\-7535,[Link](https://doi.org/10.1007/s10515-024-00424-1),[Document](https://dx.doi.org/10.1007/s10515-024-00424-1)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p1.1)\.
- \[15\]Y\. Liu, W\. Zhang, G\. Qin, and J\. Zhao\(2022\)A comparative study on the effect of data imbalance on software defect prediction\.Procedia Computer Science214,pp\. 1603–1616\(en\)\.External Links:ISSN 18770509,[Link](https://linkinghub.elsevier.com/retrieve/pii/S1877050922020610),[Document](https://dx.doi.org/10.1016/j.procs.2022.11.349)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p2.1),[§I](https://arxiv.org/html/2606.24968#S1.p3.1)\.
- \[16\]A\. C\. Lorena, L\. P\. F\. Garcia, J\. Lehmann, M\. C\. P\. Souto, and T\. K\. Ho\(2019\-09\)How complex is your classification problem? a survey on measuring classification complexity\.ACM Comput\. Surv\.52\(5\)\.External Links:ISSN 0360\-0300,[Link](https://doi.org/10.1145/3347711),[Document](https://dx.doi.org/10.1145/3347711)Cited by:[§V\-A](https://arxiv.org/html/2606.24968#S5.SS1.p1.4),[§V\-A](https://arxiv.org/html/2606.24968#S5.SS1.p3.5)\.
- \[17\]R\. C\. Prati, G\. E\. A\. P\. A\. Batista, and M\. C\. Monard\(2004\)Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior\.InMICAI 2004: Advances in Artificial Intelligence,G\. Goos, J\. Hartmanis, J\. Van Leeuwen, R\. Monroy, G\. Arroyo\-Figueroa, L\. E\. Sucar, and H\. Sossa \(Eds\.\),Vol\.2972,pp\. 312–321\(en\)\.Note:Series Title: Lecture Notes in Computer ScienceExternal Links:ISBN 978\-3\-540\-21459\-5 978\-3\-540\-24694\-7,[Link](http://link.springer.com/10.1007/978-3-540-24694-7_32),[Document](https://dx.doi.org/10.1007/978-3-540-24694-7%5F32)Cited by:[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1)\.
- \[18\]M\. B\. Shah, M\. M\. Rahman, and F\. Khomh\(2025\-09\)Towards understanding the impact of data bugs on deep learning models in software engineering\.Empirical Softw\. Engg\.30\(6\)\.External Links:ISSN 1382\-3256,[Link](https://doi.org/10.1007/s10664-025-10717-y),[Document](https://dx.doi.org/10.1007/s10664-025-10717-y)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p4.1),[§II\-C](https://arxiv.org/html/2606.24968#S2.SS3.p1.1),[§V\-F](https://arxiv.org/html/2606.24968#S5.SS6.p1.1)\.
- \[19\]M\. Shepperd, D\. Bowes, and T\. Hall\(2014\)Researcher bias: the use of machine learning in software defect prediction\.IEEE Transactions on Software Engineering40\(6\),pp\. 603–616\.External Links:[Document](https://dx.doi.org/10.1109/TSE.2014.2322358)Cited by:[§II\-A](https://arxiv.org/html/2606.24968#S2.SS1.p1.1)\.
- \[20\]C\. Tantithamthavorn, A\. E\. Hassan, and K\. Matsumoto\(2020\)The impact of class rebalancing techniques on the performance and interpretation of defect prediction models\.IEEE Transactions on Software Engineering46\(11\),pp\. 1200–1219\.External Links:[Document](https://dx.doi.org/10.1109/TSE.2018.2876537)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p3.1),[§II\-A](https://arxiv.org/html/2606.24968#S2.SS1.p1.1),[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1),[§II\-C](https://arxiv.org/html/2606.24968#S2.SS3.p1.1)\.
- \[21\]I\. Tomek\(1976\)An experiment with the edited nearest\-neighbor rule\.IEEE Transactions on Systems, Man, and CyberneticsSMC\-6\(6\),pp\. 448–452\.External Links:[Document](https://dx.doi.org/10.1109/TSMC.1976.4309523)Cited by:[§V\-C2](https://arxiv.org/html/2606.24968#S5.SS3.SSS2.p1.1),[§V\-D](https://arxiv.org/html/2606.24968#S5.SS4.p3.2)\.
- \[22\]R\. Wang, F\. Liu, and Y\. Bai\(2024\)A software defect prediction method that simultaneously addresses class overlap and noise issues after oversampling\.Electronics13\(20\)\.External Links:[Link](https://www.mdpi.com/2079-9292/13/20/3976),ISSN 2079\-9292Cited by:[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1)\.
- \[23\]S\. Yen and Y\. Lee\(2006\)Under\-sampling approaches for improving prediction of the minority class in an imbalanced dataset\.InIntelligent Control and Automation: International Conference on Intelligent Computing, ICIC 2006 Kunming, China, August 16–19, 2006,D\. Huang, K\. Li, and G\. W\. Irwin \(Eds\.\),pp\. 731–740\.External Links:ISBN 978\-3\-540\-37256\-1,[Document](https://dx.doi.org/10.1007/978-3-540-37256-1%5F89),[Link](https://doi.org/10.1007/978-3-540-37256-1_89)Cited by:[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1)\.
- \[24\]Y\. Zhang, N\. Liu, Y\. Zhao, J\. Fan, and L\. Gong\(2026\)Evaluating the interactions between class overlap and class imbalance for software defect prediction\.Expert Systems with Applications296,pp\. 129067\.External Links:ISSN 0957\-4174,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.eswa.2025.129067),[Link](https://www.sciencedirect.com/science/article/pii/S0957417425026843)Cited by:[§I](https://arxiv.org/html/2606.24968#S1.p2.1),[§II\-B](https://arxiv.org/html/2606.24968#S2.SS2.p1.1)\.

Similar Articles

When Does Quality-Aware Multimodal Fusion Matter? A Leakage-Safe Diagnostic for Decision-Level Dependence

arXiv cs.LG

This paper proposes a leakage-safe diagnostic to test whether quality-aware multimodal fusion methods actually use reliability scores during inference, by permuting these scores across test examples. Experiments on StressID and CMU-MOSEI show that shuffled reliability scores leave performance unchanged, indicating that quality signals only influence decisions when they reliably predict unimodal correctness.