Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions
Summary
This paper introduces Counterfactual Explanation Consistency (CEC), a framework to detect and mitigate hidden procedural bias in outcome-fair models by aligning feature attributions between individuals and their counterfactual counterparts, with experiments on credit and income datasets.
View Cached Full Text
Cached at: 05/14/26, 06:17 AM
# Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions
Source: [https://arxiv.org/html/2605.12701](https://arxiv.org/html/2605.12701)
###### Abstract
Machine learning algorithms in socially sensitive domains \(e\.g\., credit decisions\) often focus on equalizing predictive outcomes\. However, satisfying these metrics does not guarantee that models use the same reasoning for different groups\. We show that existing outcome\-fair models can still apply fundamentally different reasoning to individuals, a “hidden procedural bias” missed by standard fairness metrics and algorithms\. We propose Counterfactual Explanation Consistency \(CEC\), a framework that detects and mitigates this bias by aligning feature attributions between individuals and their counterfactual counterparts\. Key contributions include a nearest\-neighbor counterfactual generation method, a modified baseline for integrated gradient comparisons, an individual\-level procedural fairness metric, and a corresponding training loss\. We introduce a taxonomy identifying “Regime B” \(same outcome, different reasoning\) as a critical blind spot\. Experiments on synthetic data, German Credit, Adult Income, and HMDA mortgage data demonstrate that outcome\-fair baselines exhibit substantial hidden bias, while CEC substantially reduces it with modest utility cost\.
## Introduction
Machine learning \(ML\) models are now widely deployed in socially sensitive domains such as financial services for credit scoring, loan underwriting, and risk assessment\(Bello[2023](https://arxiv.org/html/2605.12701#bib.bib4); Dastileet al\.[2020](https://arxiv.org/html/2605.12701#bib.bib5)\)\. Regulatory frameworks such as the Equal Credit Opportunity Act \(ECOA\) and the Fair Housing Act prohibit lenders from disadvantaging individuals based on protected attributes, including race, gender, and ethnicity\(Act[2018](https://arxiv.org/html/2605.12701#bib.bib1),[1968](https://arxiv.org/html/2605.12701#bib.bib2)\)\. Ensuring that ML models comply with these requirements has become a central challenge for practitioners and regulators alike\.
Most algorithmic fairness research addresses this challenge through*outcome fairness*, which focuses on comparing predictive outcomes across groups using criteria such as demographic parity or equalized odds\(Hardtet al\.[2016](https://arxiv.org/html/2605.12701#bib.bib28); Feldmanet al\.[2015](https://arxiv.org/html/2605.12701#bib.bib116)\)\. While effective at reducing outcome disparities, these metrics leave a critical question unanswered, which isdoes the model rely on the same reasoning when evaluating individuals from different groups?
This question has deep roots in theories of procedural justice\. In the philosophical and legal traditions, fairness requires not only equitable outcomes but also consistent application of decision criteria\(Rawls[1971](https://arxiv.org/html/2605.12701#bib.bib175); Leventhal[1980](https://arxiv.org/html/2605.12701#bib.bib316); Thibautet al\.[1973](https://arxiv.org/html/2605.12701#bib.bib271)\)\. The principle that “like cases should be treated alike” demands that individuals with comparable qualifications be evaluated using the same standards, regardless of demographic membership\. In lending regulation, ECOA explicitly prohibits*disparate treatment*even when outcomes appear equal\(Act[2018](https://arxiv.org/html/2605.12701#bib.bib1)\)\. A lender who evaluates Male applicants primarily on credit score but evaluates Female applicants primarily on employment history has engaged in prohibited conduct, regardless of whether approval rates are balanced\. This legal and ethical framework motivates our focus on procedural fairness in automated decision systems\.
Figure[1](https://arxiv.org/html/2605.12701#Sx1.F1)illustrates the core problem\. Two loan applicants with nearly identical financial profiles but different protected groups both receive approval, satisfying outcome fairness\. However, examining the model’s procedure reveals different reasoning\. Applicant A’s decision is driven by credit score and income, while Applicant B’s decision depends primarily on employment history and collateral\. This*hidden procedural bias*\(different reasoning pathways producing the same outcome\) is invisible to most traditional fairness metrics yet constitutes disparate treatment under ECOA\.
Applicant A \(Group 0\)Income=$60K, Score=720, ApprovedCredit Score0\.55Income0\.28Empl\. History0\.10Collateral0\.05Other0\.02Applicant B \(Group 1\)Income=$62K, Score=725, ApprovedCredit Score0\.08Income0\.02Empl\. History0\.48Collateral0\.38Other0\.04Outcome Fair ✓Same decisionProcedurally Unfair×\\timesDifferent reasoningFigure 1:Two financially similar applicants from different demographic groups receive the same loan approval \(outcome\-fair\), but the model’s procedure reveals entirely different reasoning \(procedurally unfair\)\. This*Regime B*bias is invisible to standard fairness metrics\.We formalize this phenomenon through a taxonomy of four fairness regimes based on whether a model’s predictions and explanations remain consistent under counterfactual changes to the protected attribute \(Table[1](https://arxiv.org/html/2605.12701#Sx1.T1)\)\. For an individual\(x,y,a\)\(x,y,a\)with counterfactualx~\\tilde\{x\}from the opposite group, we assess two properties:*prediction consistency*\(does the classification remain the same?\) and*explanation consistency*\(does the model’s reasoning remain the same?\)\. The cross\-product yields four regimes\. Existing fairness metrics target Regimes C and D, where outcome disparities are observable\. However, Regime B, in which predictions agree but reasoning differ is equally problematic from both legal and ethical standpoints, yet remains invisible to all standard metrics\.
Table 1:Taxonomy of fairness regimes\. CEC targets Regime B, which is invisible to outcome\-based methods\.To address this gap, we propose*Counterfactual Explanation Consistency*\(CEC\), a framework for detecting and mitigating hidden procedural bias\. The core idea is to compare integrated gradient attribution vectors between a factual individual and a counterfactual counterpart\. If the model is procedurally fair, these attribution vectors should be similar\. The model should weigh credit score, income, and other financial features in the same proportions regardless of which group the applicant belongs to\. When they diverge, CEC quantifies the degree of hidden bias and provides a differentiable training signal to mitigate it\. In this paper, we make the following contributions:
- •We formalize hidden procedural bias and introduce a 2×2 taxonomy distinguishing outcome vs\. procedural fairness violations\.
- •We propose a nearest\-neighbor counterfactual generation method that produces realistic matches by controlling for financial merit and creditworthiness outcome, without requiring causal graphs or structural equations\.
- •We propose*Counterfactual Explanation Consistency \(CEC\)*, a metric that measures explanation stability under demographic counterfactuals, along with a differentiable training loss that jointly optimizes accuracy, outcome fairness, and explanation consistency\.
- •We conduct a comprehensive evaluation across synthetic, benchmark, and real\-world lending datasets demonstrating that outcome\-fair baselines contain substantial hidden bias, which CEC substantially reduces with minimal utility cost\.
## Related Work
### Outcome Fairness in Machine Learning
The algorithmic fairness literature has developed numerous statistical criteria for evaluating predictive equity\.*Demographic parity*requires equal positive prediction rates across groups\(Feldmanet al\.[2015](https://arxiv.org/html/2605.12701#bib.bib116)\);*equalized odds*requires equal true positive and false positive rates\(Hardtet al\.[2016](https://arxiv.org/html/2605.12701#bib.bib28)\); and*calibration*requires that predicted probabilities reflect true outcomes within each group\(Chouldechova[2017](https://arxiv.org/html/2605.12701#bib.bib27)\)\. These criteria are known to be mutually incompatible in general\(Kleinberget al\.[2017](https://arxiv.org/html/2605.12701#bib.bib149)\), leading to a rich literature on navigating trade\-offs\.
Mitigating bias spans the machine learning pipeline\. Pre\-processing methods transform training data to remove correlations with protected attributes\(Feldmanet al\.[2015](https://arxiv.org/html/2605.12701#bib.bib116); Kamiran and Calders[2012](https://arxiv.org/html/2605.12701#bib.bib62); Popoola and Sheppard[2024](https://arxiv.org/html/2605.12701#bib.bib281)\)\. In\-processing methods incorporate fairness constraints during training through reductions\(Agarwalet al\.[2018](https://arxiv.org/html/2605.12701#bib.bib89)\), adversarial objectives\(Zhanget al\.[2018](https://arxiv.org/html/2605.12701#bib.bib61)\), or constrained optimization\(Cotteret al\.[2019](https://arxiv.org/html/2605.12701#bib.bib317)\)\. Post\-processing methods adjust decision thresholds after training\(Hardtet al\.[2016](https://arxiv.org/html/2605.12701#bib.bib28)\)\. While these approaches effectively reduce outcome disparities, they evaluate only*what*a model predicts, not*how*it arrives at that prediction\. Our work shows that this gap allows models to harbor hidden procedural bias even when all outcome constraints are satisfied\.
### Explainability and Fairness
Post\-hoc explanation methods such as Local Interpretable Model\-Agnostic Explanations \(LIME\)\(Ribeiroet al\.[2016](https://arxiv.org/html/2605.12701#bib.bib35)\), SHapley Additive exPlanations \(SHAP\)\(Lundberg and Lee[2017](https://arxiv.org/html/2605.12701#bib.bib157)\), and integrated gradients \(IG\)\(Sundararajanet al\.[2017](https://arxiv.org/html/2605.12701#bib.bib189)\)have become essential tools for understanding model behavior in high\-stakes domains\. A growing body of work connects explainability with fairness\.[Daiet al\.](https://arxiv.org/html/2605.12701#bib.bib19)\([2022](https://arxiv.org/html/2605.12701#bib.bib19)\) examine whether explanation quality differs across demographic groups, finding that models may provide less informative explanations for minority groups\.[Begleyet al\.](https://arxiv.org/html/2605.12701#bib.bib37)\([2020](https://arxiv.org/html/2605.12701#bib.bib37)\) propose using feature importance to audit models for discriminatory patterns\.[Agarwalet al\.](https://arxiv.org/html/2605.12701#bib.bib56)\([2022](https://arxiv.org/html/2605.12701#bib.bib56)\) provide benchmarks for evaluating explanation methods, and[Slacket al\.](https://arxiv.org/html/2605.12701#bib.bib82)\([2020](https://arxiv.org/html/2605.12701#bib.bib82)\) show that explanations can be manipulated to hide bias\.
### Procedural Fairness
The concept of procedural fairness originates in social psychology and legal theory\.[Leventhal](https://arxiv.org/html/2605.12701#bib.bib316)\([1980](https://arxiv.org/html/2605.12701#bib.bib316)\) identified*consistency*as a core component of procedural justice in which the same decision rules should apply across persons and across time\.[Grgić\-Hlačaet al\.](https://arxiv.org/html/2605.12701#bib.bib128)\([2018](https://arxiv.org/html/2605.12701#bib.bib128)\) study which features humans consider fair to use in algorithmic decisions, finding strong preferences for process\-based criteria\. In the machine learning context, procedural fairness has received less attention than outcome fairness, though recent work has begun to formalize process\-based notions\(Zhaoet al\.[2023](https://arxiv.org/html/2605.12701#bib.bib207); Germinoet al\.[2025](https://arxiv.org/html/2605.12701#bib.bib86)\)\.[Dworket al\.](https://arxiv.org/html/2605.12701#bib.bib25)\([2012](https://arxiv.org/html/2605.12701#bib.bib25)\) proposed*individual fairness*\(similar individuals should receive similar outcomes\), which captures a related but distinct intuition\. Our method extends this notion by requiring not just similar outcomes but similar*reasoning processes*\.
### Counterfactual Fairness and Reasoning
Counterfactual reasoning provides a natural framework for individual\-level fairness analysis\.[Kusneret al\.](https://arxiv.org/html/2605.12701#bib.bib26)\([2017](https://arxiv.org/html/2605.12701#bib.bib26)\) formalize counterfactual fairness using structural causal models \(SCMs\), requiring that predictions remain unchanged under interventions on protected attributes\. Extensions address causal pathway constraints\(Wuet al\.[2019](https://arxiv.org/html/2605.12701#bib.bib233); Chiappa[2019](https://arxiv.org/html/2605.12701#bib.bib106)\)and relaxations for approximate fairness\. However, building such graphs accurately is difficult in practice\.
Counterfactual*explanation*methods such as DiCE\(Mothilalet al\.[2020](https://arxiv.org/html/2605.12701#bib.bib167)\), FACE\(Poyiadziet al\.[2020](https://arxiv.org/html/2605.12701#bib.bib174)\), and CARLA\(Pawelczyket al\.[2021](https://arxiv.org/html/2605.12701#bib.bib58)\)generate alternative inputs that would change a prediction, focusing on*recourse*\(how can an individual obtain a different outcome?\) rather than fairness auditing\. We use counterfactual reasoning to assess whether*explanations*remain consistent across demographic groups, and we achieve this without requiring causal graphs through merit\-based nearest\-neighbor matching\.
### Multi\-Objective Fair Learning
Fair classification can be regarded as multi\-objective, involving potential trade\-offs between accuracy and one or more fairness constraints\(Wanget al\.[2024](https://arxiv.org/html/2605.12701#bib.bib49); Cotteret al\.[2019](https://arxiv.org/html/2605.12701#bib.bib317)\)\. Existing methods typically balance two objectives, which are predictive performance and outcome fairness\(Wei and Niethammer[2022](https://arxiv.org/html/2605.12701#bib.bib204); Nagpalet al\.[2025](https://arxiv.org/html/2605.12701#bib.bib306)\)\. Our training objective extends this to a three\-dimensional trade\-off by introducing explanation consistency as an additional objective, demonstrating that procedural fairness can be achieved jointly with outcome fairness and accuracy at modest cost\.
## Methodology
### Notation and Problem Setup
We consider binary classification in a high\-stakes, finance\-based decision domain such as credit lending\. Let𝒳⊆ℝd\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{d\}denote the feature space,𝒴=\{0,1\}\\mathcal\{Y\}=\\\{0,1\\\}the label space \(e\.g\.,y=1y=1for loan approval\), anda∈𝒜=\{0,1\}a\\in\\mathcal\{A\}=\\\{0,1\\\}a binary protected attribute \(e\.g\., race or gender\)\. Given training data𝒟=\{\(xi,yi,ai\)\}i=1n\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{n\}, we learn a scoring functionfθ:𝒳→ℝf\_\{\\theta\}:\\mathcal\{X\}\\to\\mathbb\{R\}parameterized byθ\\theta, with predicted labely^=𝕀\[fθ\(x\)≥τ\]\\hat\{y\}=\\mathbb\{I\}\[f\_\{\\theta\}\(x\)\\geq\\tau\]for some thresholdτ\\tau\.
A key part of our framework is the distinction between financial and non\-financial features\.
###### Definition 1\(Financial Feature Set\)\.
Letℱ⊆\{1,…,d\}\\mathcal\{F\}\\subseteq\\\{1,\\ldots,d\\\}denote the indices of*financial features*\. A feature can be regarded as a financial feature iff it satisfies three criteria:
1. 1\.Merit\-based: they reflect creditworthiness or repayment ability \(e\.g\., income, credit score, debt\-to\-income ratio\),
2. 2\.Legally permissible: they are not prohibited by fair lending regulation, and
3. 3\.Not demographic proxies: they are not strong proxies for protected attributes \(e\.g\., residential zip code is excluded due to correlation with race from historical redlining\)\.
In practice,ℱ\\mathcal\{F\}is determined through domain expertise and regulatory guidance\. For credit lending, typical members include income, credit score, length of credit history, debt\-to\-income ratio, employment length, and liquid assets\. Features typically excluded include residential address, educational institution, and certain occupation categories that may be demographically imbalanced\.
Our method, CEC, ensures thatfθf\_\{\\theta\}uses consistent reasoning across demographic groups for individuals who are comparable onℱ\\mathcal\{F\}\. The framework consists of three components: 1\) counterfactual generation, 2\) consistent baseline selection, and 3\) the CEC metric with its training loss\.
### Counterfactual Generation
#### The Problem with Naive Counterfactuals\.
A naive approach to generating counterfactuals simply flips the protected attribute while holding all other features constant\. Given individualxxwith protected attributeaa, the naïve counterfactual isx~naïve=\(x1,…,xj−1,1−a,xj\+1,…,xd\)\\tilde\{x\}\_\{\\text\{na\\"\{i\}ve\}\}=\(x\_\{1\},\\ldots,x\_\{j\-1\},1\{\-\}a,x\_\{j\+1\},\\ldots,x\_\{d\}\)wherejjindexes the protected attribute\. This approach suffers from a critical flaw, which is that many features are correlated with protected attributes due to historical and structural factors\. Residential zip code correlates with race due to segregation, certain employment sectors are gender\-imbalanced, and educational backgrounds track socioeconomic status, both of which are regarded as non\-demographic proxies\. Flipping race while holding zip code constant produces an individual who may not exist statistically, meaning the combination is rarely or never observed in the real population\.
#### Label\-Stratified Nearest\-Neighbor Matching\.
We propose a novel approach that avoids causal modeling while producing realistic and interpretable counterfactuals\. The key insight is that for fairness auditing in lending, we need only identify individuals from different demographic groups who are comparable in terms of financial features that*should*legitimately determine creditworthiness\.
We approximate counterfactuals using matched opposite\-group instances with similar financial characteristics\. This follows standard practice in counterfactual fairness auditing when structural causal models are unavailable\(Schwabet al\.[2018](https://arxiv.org/html/2605.12701#bib.bib90)\)\. We do not claim causal counterfactual validity\. Rather, we use matched proxies to approximate demographic interventions while preserving plausibility in financial features\.
###### Definition 2\(Counterfactual Matching\)\.
Given individual\(x,y,a\)\(x,y,a\), financial feature setℱ\\mathcal\{F\}, and training data𝒟\\mathcal\{D\}, a counterfactual is:
x~=xi∗,i∗=argmini:ai=1−a,yi=y‖xℱstd−\(xi\)ℱstd‖2\\tilde\{x\}=x\_\{i^\{\*\}\},\\;\\;i^\{\*\}=\\operatorname\*\{argmin\}\_\{i:\\,a\_\{i\}=1\{\-\}a,\\,y\_\{i\}=y\}\\\|x\_\{\\mathcal\{F\}\}^\{\\text\{std\}\}\-\(x\_\{i\}\)\_\{\\mathcal\{F\}\}^\{\\text\{std\}\}\\\|\_\{2\}wherexℱstdx\_\{\\mathcal\{F\}\}^\{\\text\{std\}\}denotes the standardized subvector ofxxrestricted toℱ\\mathcal\{F\}\.
Stratifying on the true labelyyis an important design choice because it ensures that the factual and counterfactual have identical ground\-truth creditworthiness, so any measured explanation difference reflects group\-dependent reasoning rather than legitimate risk differences\. We note that observed labels may not perfectly capture ground\-truth creditworthiness due to selective labeling \(only approved applicants have observed repayment outcomes\)\. This is a common limitation shared by all financial modeling, especially in credit decision\-making\. The counterfactualx~\\tilde\{x\}is always a*real individual*from the training data, not a synthetic example, ensuring distributional plausibility\. Algorithm[1](https://arxiv.org/html/2605.12701#alg1)and Figure[2](https://arxiv.org/html/2605.12701#Sx3.F2)show this procedure with quality safeguards\.
Algorithm 1Label\-Stratified Counterfactual Generation0:Training data
𝒟=\{\(xi,yi,ai\)\}i=1n\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{n\}, financial features
ℱ\\mathcal\{F\}, distance threshold
τ\\tau
0:Counterfactual map
𝒞:i↦x~i\\mathcal\{C\}:i\\mapsto\\tilde\{x\}\_\{i\}
1:// Preprocessing \(one\-time\)
2:Compute standardization:
μℱ,σℱ\\mu\_\{\\mathcal\{F\}\},\\sigma\_\{\\mathcal\{F\}\}from training data
3:for
\(y,a\)∈\{0,1\}×\{0,1\}\(y,a\)\\in\\\{0,1\\\}\\times\\\{0,1\\\}do
4:
𝒟y,a←\{\(xi,i\):yi=y,ai=a\}\\mathcal\{D\}\_\{y,a\}\\leftarrow\\\{\(x\_\{i\},i\):y\_\{i\}=y,\\,a\_\{i\}=a\\\}
5:if
\|𝒟y,a\|\>0\|\\mathcal\{D\}\_\{y,a\}\|\>0then
6:Standardize:
Xy,aℱ←\{\(xi\)ℱstd\}X\_\{y,a\}^\{\\mathcal\{F\}\}\\leftarrow\\\{\(x\_\{i\}\)\_\{\\mathcal\{F\}\}^\{\\text\{std\}\}\\\}
7:Build KD\-tree index
ℐy,a\\mathcal\{I\}\_\{y,a\}on
Xy,aℱX\_\{y,a\}^\{\\mathcal\{F\}\}
8:endif
9:endfor
10:// Query \(for each instance\)
11:for
i=1,…,ni=1,\\ldots,ndo
12:
a¯←1−ai\\bar\{a\}\\leftarrow 1\-a\_\{i\}\{Opposite group\}
13:if
\|𝒟yi,a¯\|=0\|\\mathcal\{D\}\_\{y\_\{i\},\\bar\{a\}\}\|=0then
14:Mark
iias unmatched;continue
15:endif
16:
\(j∗,d∗\)←ℐyi,a¯\.query\(\(xi\)ℱstd\)\(j^\{\*\},d^\{\*\}\)\\leftarrow\\mathcal\{I\}\_\{y\_\{i\},\\bar\{a\}\}\.\\text\{query\}\(\(x\_\{i\}\)\_\{\\mathcal\{F\}\}^\{\\text\{std\}\}\)
17:if
τ\>0\\tau\>0and
d∗\>τd^\{\*\}\>\\tauthen
18:Mark
iias unmatched \{Poor quality\}
19:else
20:
𝒞\[i\]←xj∗\\mathcal\{C\}\[i\]\\leftarrow x\_\{j^\{\*\}\}\{Full feature vector\}
21:endif
22:endfor
The algorithm operates in two phases\. ThePreprocessingphase partitions the training data into four subsets by\(y,a\)\(y,a\), standardizes financial features using training\-set statistics to ensure scale\-invariant distance computation, and builds KD\-tree indices for each non\-empty partition\. The complexity isO\(nlogn\)O\(n\\log n\)per subset\.Queryingfinds the nearest neighbor in the opposite\-group, same\-label partition for each instance, withO\(\|ℱ\|logn\)O\(\|\\mathcal\{F\}\|\\log n\)cost per query\. The additional distance thresholdτ\\taurejects poor\-quality matches when no suitable counterpart exists\.
Labeled Stratified NN MatchingTraining Data\{\(xi,yi,ai\)\}\\\{\(x\_\{i\},y\_\{i\},a\_\{i\}\)\\\}Partition by\(y,a\)∈\{0,1\}\(y,a\)\\in\\\{0,1\\\}Build KD\-treeℐy,a\\mathcal\{I\}\_\{y,a\}onℱ\\mathcal\{F\}Label\-GroupBaselinesby,ab\_\{y,a\}NN Matchingx~i←\\tilde\{x\}\_\{i\}\\leftarrowqueryℐyi,1−ai\\mathcal\{I\}\_\{y\_\{i\},1\-a\_\{i\}\}Paired Dataset\{\(xi,x~i,yi,ai,byi,ai\)\}i=1n\\\{\(x\_\{i\},\\tilde\{x\}\_\{i\},y\_\{i\},a\_\{i\},b\_\{y\_\{i\},a\_\{i\}\}\)\\\}\_\{i=1\}^\{n\}subsetslabelmatchFigure 2:Training data is partitioned by label and group\. KD\-tree indices enable efficient nearest\-neighbor matching on financial featuresℱ\\mathcal\{F\}, producing paired data with label\-group baselines for training\.
### The Consistent Baseline Principle
#### Integrated Gradients Background\.
Integrated gradients \(IG\)\(Sundararajanet al\.[2017](https://arxiv.org/html/2605.12701#bib.bib189)\)compute feature attributions by accumulating gradients along a straight\-line path from a baselinebbto the inputxx:
IGj\(x;b\)=\(xj−bj\)∫01∂f∂xj\|b\+α\(x−b\)dα\\text\{IG\}\_\{j\}\(x;b\)=\(x\_\{j\}\-b\_\{j\}\)\\int\_\{0\}^\{1\}\\frac\{\\partial f\}\{\\partial x\_\{j\}\}\\bigg\|\_\{b\+\\alpha\(x\-b\)\}d\\alphaIG satisfies two desirable axioms:*completeness*\(attributions sum tof\(x\)−f\(b\)f\(x\)\-f\(b\)\) and*sensitivity*\(if changing featurejjchanges the prediction, it receives a nonzero attribution\)\. The choice of baselinebbdetermines the reference point from which feature importance is measured\.
#### The Confound of Group\-Specific Baselines\.
When comparing explanations across groups for a factual–counterfactual pair\(x,x~\)\(x,\\tilde\{x\}\), natural choices are group\-specific baselinesb0b\_\{0\}for Group 0 andb1b\_\{1\}for Group 1\. However, this confounds two distinct effects\. Consider an example where we supposexxhas income $60K and belongs to Group 0 \(average incomeb0inc=$50Kb\_\{0\}^\{\\text\{inc\}\}=\\mathdollar 50\\text\{K\}\), whilex~\\tilde\{x\}has income $62K and belongs to Group 1 \(average incomeb1inc=$55Kb\_\{1\}^\{\\text\{inc\}\}=\\mathdollar 55\\text\{K\}\)\. Even if the model weighs income identically for both groups, the attributions differ:IGinc\(x;b0\)∝\(60−50\)=10\\text\{IG\}\_\{\\text\{inc\}\}\(x;b\_\{0\}\)\\propto\(60\{\-\}50\)=10versusIGinc\(x~;b1\)∝\(62−55\)=7\\text\{IG\}\_\{\\text\{inc\}\}\(\\tilde\{x\};b\_\{1\}\)\\propto\(62\{\-\}55\)=7\. The difference of33arises entirely from the different baselines, not from discriminatory reasoning, making\. This makes the comparison invalid as a fairness measure\.
###### Principle 1\(Consistent Baseline\)\.
For a factual–counterfactual pair\(x,x~\)\(x,\\tilde\{x\}\)wherexxhas labelyyand attributeaa, both attributions must use the same baseline:
IG\(x;by,a\)andIG\(x~;by,a\)\\text\{IG\}\(x;\\,b\_\{y,a\}\)\\quad\\text\{and\}\\quad\\text\{IG\}\(\\tilde\{x\};\\,b\_\{y,a\}\)whereby,a=1\|\{i:yi=y,ai=a\}\|∑i:yi=y,ai=axib\_\{y,a\}=\\frac\{1\}\{\|\\\{i:y\_\{i\}=y,\\,a\_\{i\}=a\\\}\|\}\\sum\_\{i:y\_\{i\}=y,\\,a\_\{i\}=a\}x\_\{i\}is the label\-group mean for the factual’s label and group\.
Using the factual’s group baseline, the example becomes:IGinc\(x;b0\)∝\(60−50\)=10\\text\{IG\}\_\{\\text\{inc\}\}\(x;b\_\{0\}\)\\propto\(60\{\-\}50\)=10andIGinc\(x~;b0\)∝\(62−50\)=12\\text\{IG\}\_\{\\text\{inc\}\}\(\\tilde\{x\};b\_\{0\}\)\\propto\(62\{\-\}50\)=12\. The difference of22now reflects the genuine input difference, with no baseline\-induced confounder\. Any disproportionate attribution change beyond what the input difference warrants reveals discriminatory model behavior\. We formalize this guarantee as follows\.
###### Theorem 1\(Baseline Consistency Isolates Discrimination\)\.
Letffbe a differentiable function and\(x,x~\)\(x,\\tilde\{x\}\)a pair with consistent baselinebb\. If the model’s gradients on non\-protected features are approximately constant along both integration paths \(i\.e\.,∂f∂xk\|b\+α\(x−b\)≈∂f∂xk\|x\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\|\_\{b\+\\alpha\(x\-b\)\}\\approx\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\|\_\{x\}forα∈\[0,1\]\\alpha\\in\[0,1\]\), then for non\-protected featurekk:
IGk\(x;b\)−IGk\(x~;b\)≈\(xk−x~k\)⋅∂f∂xk\|x\\text\{IG\}\_\{k\}\(x;b\)\-\\text\{IG\}\_\{k\}\(\\tilde\{x\};b\)\\approx\(x\_\{k\}\-\\tilde\{x\}\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}
###### Proof\.
By the definition of integrated gradients, for a non\-protected featurekk:
IGk\(x;b\)=\(xk−bk\)∫01∂f∂xk\|b\+α\(x−b\)dαIGk\(x~;b\)\\text\{IG\}\_\{k\}\(x;b\)=\(x\_\{k\}\-b\_\{k\}\)\\int\_\{0\}^\{1\}\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{b\+\\alpha\(x\-b\)\}d\\alpha\\ \\text\{IG\}\_\{k\}\(\\tilde\{x\};b\)=\(x~k−bk\)∫01∂f∂xk\|b\+α\(x~−b\)dα=\(\\tilde\{x\}k\-b\_\{k\}\)\\int\_\{0\}^\{1\}\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{b\+\\alpha\(\\tilde\{x\}\-b\)\}d\\alphaUnder the assumption that gradients are approximately constant along both integration paths, i\.e\.,∂f∂xk\|b\+α\(x−b\)≈∂f∂xk\|x\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\_\{b\+\\alpha\(x\-b\)\}\\approx\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\_\{x\}and∂f∂xk\|b\+α\(x~−b\)≈∂f∂xk\|x\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\_\{b\+\\alpha\(\\tilde\{x\}\-b\)\}\\approx\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\_\{x\}for allα∈\[0,1\]\\alpha\\in\[0,1\], the integrals collapse:
IGk\(x;b\)≈\(xk−bk\)⋅∂f∂xk\|xIGk\(x~;b\)≈\(x~k−bk\)⋅∂f∂xk\|x\\text\{IG\}\_\{k\}\(x;b\)\\approx\(x\_\{k\}\-b\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}\\ \\text\{IG\}\_\{k\}\(\\tilde\{x\};b\)\\approx\(\\tilde\{x\}\_\{k\}\-b\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}Subtracting yields:
IGk\(x;b\)−IGk\(x~;b\)≈\(xk−x~k\)⋅∂f∂xk\|x\\text\{IG\}\_\{k\}\(x;b\)\-\\text\{IG\}\_\{k\}\(\\tilde\{x\};b\)\\approx\(x\_\{k\}\-\\tilde\{x\}\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}Note that the baselinebbcancels entirely in the subtraction\. In contrast, with group\-specific baselinesbab^\{a\}andb1−ab^\{1\-a\}, the same derivation yields:
IGk\(x;ba\)−IGk\(x~;b1−a\)≈\(xk−bka\)⋅∂f∂xk\|x−\(x~k−bk1−a\)⋅\\text\{IG\}\_\{k\}\(x;b^\{a\}\)\-\\text\{IG\}\_\{k\}\(\\tilde\{x\};b^\{1\-a\}\)\\ \\approx\(x\_\{k\}\-b^\{a\}\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}\-\(\\tilde\{x\}\_\{k\}\-b^\{1\-a\}\_\{k\}\)\\cdot∂f∂xk\|x=\(xk−x~k\)⋅∂f∂xk\|x\+\(bk1−a−bka\)⋅∂f∂xk\|x\.\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}\\ =\(x\_\{k\}\-\\tilde\{x\}\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}\+\(b^\{1\-a\}\_\{k\}\-b^\{a\}\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\bigg\|\_\{x\}\.This introduces a spurious term\(bk1−a−bka\)⋅∂f∂xk\|x\(b^\{1\-a\}\_\{k\}\-b^\{a\}\_\{k\}\)\\cdot\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\_\{x\}that depends on group distribution differences rather than model discrimination\. ∎
Remark\.Theorem[1](https://arxiv.org/html/2605.12701#Thmtheorem1)is stated for the idealized case of approximately constant gradients; in practice, the nearest\-neighbor counterfactuals from Definition[2](https://arxiv.org/html/2605.12701#Thmdefinition2)differ on multiple features, not just the protected attribute\. The approximation tightens as match quality improves \(smaller‖xℱ−x~ℱ‖\\\|x\_\{\\mathcal\{F\}\}\-\\tilde\{x\}\_\{\\mathcal\{F\}\}\\\|\), which we monitor via the coverage and distance metrics reported by Algorithm[1](https://arxiv.org/html/2605.12701#alg1)\. The key guarantee is that attribution differences depend on input differences\(xk−x~k\)\(x\_\{k\}\-\\tilde\{x\}\_\{k\}\)and model gradients, with no baseline\-induced confounder\. In contrast, group\-specific baselines introduce an additional\(by,a−by,1−a\)\(b\_\{y,a\}\-b\_\{y,1\-a\}\)term that conflates demographic distribution differences with discriminatory reasoning\.
### The CEC Metric and Theoretical Properties
#### Normalization\.
IG vectors may have vastly different magnitudes across individuals\. To compare explanation*direction*rather than magnitude, we normalize:
IGnorm\(x;b\)=IG\(x;b\)‖IG\(x;b\)‖2\+ϵ\\text\{IG\}\_\{\\text\{norm\}\}\(x;b\)=\\frac\{\\text\{IG\}\(x;b\)\}\{\\\|\\text\{IG\}\(x;b\)\\\|\_\{2\}\+\\epsilon\}whereϵ=10−8\\epsilon=10^\{\-8\}prevents division by zero\.
#### The CEC Score\.
###### Definition 3\(CEC Score\)\.
For a factual–counterfactual pair\(x,x~\)\(x,\\tilde\{x\}\)with consistent baselineb=by,ab=b\_\{y,a\}, the raw L2 distance between normalized attribution vectors is:
Δraw\(x,x~\)=‖IGnorm\(x;b\)−IGnorm\(x~;b\)‖2\\Delta\_\{\\text\{raw\}\}\(x,\\tilde\{x\}\)=\\left\\\|\\text\{IG\}\_\{\\text\{norm\}\}\(x;b\)\-\\text\{IG\}\_\{\\text\{norm\}\}\(\\tilde\{x\};b\)\\right\\\|\_\{2\}Since both vectors have unit norm,Δraw∈\[0,2\]\\Delta\_\{\\text\{raw\}\}\\in\[0,2\]\. We normalize to obtain a score in\[0,1\]\[0,1\]:
ΔCEC\(x,x~\)=\(Δraw\(x,x~\)2\)\\Delta\_\{\\text\{CEC\}\}\(x,\\tilde\{x\}\)=\\\!\\left\(\\frac\{\\Delta\_\{\\text\{raw\}\}\(x,\\tilde\{x\}\)\}\{2\}\\right\)where division by22\(the maximum L2 distance between unit vectors\) maps the score to\[0,1\]\[0,1\]\. The population\-level CEC is:
CEC\(f;𝒟\)=1n∑i=1nΔCEC\(xi,x~i\)\\text\{CEC\}\(f;\\mathcal\{D\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\Delta\_\{\\text\{CEC\}\}\(x\_\{i\},\\tilde\{x\}\_\{i\}\)
#### Theoretical Properties\.
The CEC metric admits a geometric interpretation, where each normalized attribution vector lies on the unit hypersphere inℝd\\mathbb\{R\}^\{d\}\. The raw distanceΔraw\\Delta\_\{\\text\{raw\}\}measures the chord length between two points on this hypersphere, related to angular separation byΔraw=2\(1−cosθ\)\\Delta\_\{\\text\{raw\}\}=\\sqrt\{2\(1\-\\cos\\theta\)\}\. The normalization by22converts this to a\[0,1\]\[0,1\]scale where the score equals122\(1−cosθ\)\\frac\{1\}\{2\}\\sqrt\{2\(1\-\\cos\\theta\)\}, providing an intuitive interpretation, where0means identical explanations \(θ=0\\theta=0\) and11means maximally opposed explanations \(θ=π\\theta=\\pi\)\.
###### Proposition 1\(Bounded Range\)\.
0≤ΔCEC\(x,x~\)≤10\\leq\\Delta\_\{\\text\{CEC\}\}\(x,\\tilde\{x\}\)\\leq 1\. The score is0when explanations are identical \(attribution vectors are parallel\) and11when they are maximally opposed \(antipodal on the unit hypersphere\)\.
###### Proof\.
Letg=IGnorm\(x;b\)g=\\text\{IG\}\_\{\\text\{norm\}\}\(x;b\)andg~=IGnorm\(x~;b\)\\tilde\{g\}=\\text\{IG\}\_\{\\text\{norm\}\}\(\\tilde\{x\};b\)\. By construction,‖g‖2=‖g~‖2=1\\\|g\\\|\_\{2\}=\\\|\\tilde\{g\}\\\|\_\{2\}=1\. Expanding the squared norm of their difference:
‖g−g~‖22=‖g‖22−2g⊤g~\+‖g~‖22=2−2cosθ\\\|g\-\\tilde\{g\}\\\|\_\{2\}^\{2\}=\\\|g\\\|\_\{2\}^\{2\}\-2\\,g^\{\\top\}\\tilde\{g\}\+\\\|\\tilde\{g\}\\\|\_\{2\}^\{2\}=2\-2\\cos\\thetawhereθ∈\[0,π\]\\theta\\in\[0,\\pi\]is the angle betweenggandg~\\tilde\{g\}\. Sincecosθ∈\[−1,1\]\\cos\\theta\\in\[\-1,1\], we have‖g−g~‖22∈\[0,4\]\\\|g\-\\tilde\{g\}\\\|\_\{2\}^\{2\}\\in\[0,4\]and thusΔraw=‖g−g~‖2∈\[0,2\]\\Delta\_\{\\text\{raw\}\}=\\\|g\-\\tilde\{g\}\\\|\_\{2\}\\in\[0,2\]\. Division by22yieldsΔCEC=Δraw/2∈\[0,1\]\\Delta\_\{\\text\{CEC\}\}=\\Delta\_\{\\text\{raw\}\}/2\\in\[0,1\]\.
The lower boundΔCEC=0\\Delta\_\{\\text\{CEC\}\}=0is attained whenθ=0\\theta=0, i\.e\.,g=g~g=\\tilde\{g\}\(identical attribution directions\)\. The upper boundΔCEC=1\\Delta\_\{\\text\{CEC\}\}=1is attained whenθ=π\\theta=\\pi, i\.e\.,g=−g~g=\-\\tilde\{g\}\(antipodal vectors on the unit hypersphere, indicating maximally opposed explanations\)\. ∎
###### Proposition 2\(Sensitivity to Discrimination\)\.
If the model weighs featurekkdifferently across groups \(i\.e\.,∂f∂xk\|x≠∂f∂xk\|x~\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\_\{x\}\\neq\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\_\{\\tilde\{x\}\}\), thenΔCEC\(x,x~\)\>0\\Delta\_\{\\text\{CEC\}\}\(x,\\tilde\{x\}\)\>0for generic model parameterizations \(exact cancellation by opposing changes in other features is non\-generic\)\.
###### Proof\.
Letg=IGnorm\(x;b\)g=\\text\{IG\}\{\\text\{norm\}\}\(x;b\)andg~=IGnorm\(x~;b\)\\tilde\{g\}=\\text\{IG\}\{\\text\{norm\}\}\(\\tilde\{x\};b\)denote the normalized attribution vectors\. Since\|g\|=\|g~\|=1\|g\|=\|\\tilde\{g\}\|=1, we haveΔCEC=12\|g−g~\|2=0\\Delta\_\{\\text\{CEC\}\}=\\frac\{1\}\{2\}\|g\-\\tilde\{g\}\|\_\{2\}=0if and only ifg=g~g=\\tilde\{g\}, i\.e\., the normalized attribution vectors are identical\. Suppose∂f∂xk\|x≠∂f∂xk\|x~\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|x\\neq\\frac\{\\partial f\}\{\\partial x\_\{k\}\}\\big\|\{\\tilde\{x\}\}for some featurekk\. By the IG formula, this gradient difference propagates into the unnormalized attributions:IGk\(x;b\)\\text\{IG\}\_\{k\}\(x;b\)depends on∂f∂xk\\frac\{\\partial f\}\{\\partial x\_\{k\}\}evaluated along the path frombbtoxx, whileIGk\(x~;b\)\\text\{IG\}\_\{k\}\(\\tilde\{x\};b\)depends on∂f∂xk\\frac\{\\partial f\}\{\\partial x\_\{k\}\}evaluated along the path frombbtox~\\tilde\{x\}\. Since the model weighs featurekkdifferently atxxversusx~\\tilde\{x\}, the unnormalized vectorsIG\(x;b\)\\text\{IG\}\(x;b\)andIG\(x~;b\)\\text\{IG\}\(\\tilde\{x\};b\)differ in at least thekk\-th component\. AfterL2L\_\{2\}normalization,g=g~g=\\tilde\{g\}would require all component\-wise ratios to be equal:IGj\(x;b\)IGj\(x~;b\)=c\\frac\{\\text\{IG\}\_\{j\}\(x;b\)\}\{\\text\{IG\}j\(\\tilde\{x\};b\)\}=cfor alljjand some constantc\>0c\>0\. This is a system ofd−1d\-1independent constraints on the model parameters\. Since the model parameter space is continuous, the set of parameters that satisfy alld−1d\-1constraints simultaneously has measure zero \(it defines a manifold of codimensiond−1d\-1\)\. Thus, for generic model parameterizations,g≠g~g\\neq\\tilde\{g\}andΔCEC\>0\\Delta\{\\text\{CEC\}\}\>0\. ∎
These properties ensure that CEC reliably detects procedural discrimination whenever the model applies different feature weights to different groups\. We note that for a model that treats all groups identically, residualΔCEC\>0\\Delta\_\{\\text\{CEC\}\}\>0may still arise from input differences betweenxxandx~\\tilde\{x\}; however, this residual is bounded by match quality and does not reflect discriminatory reasoning\.
Multi\-Objective TrainingPaired Data\(xi,x~i,byi,ai\)\(x\_\{i\},\\tilde\{x\}\_\{i\},b\_\{y\_\{i\},a\_\{i\}\}\)Neural Netfθf\_\{\\theta\}y^i,y~^i\\hat\{y\}\_\{i\},\\hat\{\\tilde\{y\}\}\_\{i\}IG\(xi;byi,ai\)\(x\_\{i\};b\_\{y\_\{i\},a\_\{i\}\}\)IG\(x~i;byi,ai\)\(\\tilde\{x\}\_\{i\};b\_\{y\_\{i\},a\_\{i\}\}\)ℒpred\\mathcal\{L\}\_\{\\text\{pred\}\}\(BCE\)ℒEO\\mathcal\{L\}\_\{\\text\{EO\}\}\(Eq\. Odds\)ℒCEC\\mathcal\{L\}\_\{\\text\{CEC\}\}\(Expl\.\)ℒ=ℒpred\+λEOℒEO\+λCECℒCEC\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{pred\}\}\+\\lambda\_\{\\text\{EO\}\}\\mathcal\{L\}\_\{\\text\{EO\}\}\+\\lambda\_\{\\text\{CEC\}\}\\mathcal\{L\}\_\{\\text\{CEC\}\}∇θ\\nabla\_\{\\theta\}Figure 3:Phase 2 \(Training\): Each minibatch passes through the model to compute predictions and integrated gradients for both factual and counterfactual inputs \(using the same baseline\)\. Three losses, prediction accuracy, equalized odds, and explanation consistency are combined and backpropagated\.
### Training Objective
We incorporate CEC as a differentiable regularizer alongside prediction and outcome fairness losses as shown in Figure[3](https://arxiv.org/html/2605.12701#Sx3.F3)and Algorithm[2](https://arxiv.org/html/2605.12701#alg2):
minθℒpred\(θ\)\+λEOℒEO\(θ\)\+λCECℒCEC\(θ\)\\min\_\{\\theta\}\\;\\;\\mathcal\{L\}\_\{\\text\{pred\}\}\(\\theta\)\+\\lambda\_\{\\text\{EO\}\}\\,\\mathcal\{L\}\_\{\\text\{EO\}\}\(\\theta\)\+\\lambda\_\{\\text\{CEC\}\}\\,\\mathcal\{L\}\_\{\\text\{CEC\}\}\(\\theta\)whereℒpred\\mathcal\{L\}\_\{\\text\{pred\}\}is the binary cross\-entropy loss:
ℒpred=−1m∑i=1m\[yilogσ\(fθ\(xi\)\)\+\(1−yi\)log\(1−σ\(fθ\(xi\)\)\)\]\\mathcal\{L\}\_\{\\text\{pred\}\}=\-\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\left\[y\_\{i\}\\log\\sigma\(f\_\{\\theta\}\(x\_\{i\}\)\)\+\(1\{\-\}y\_\{i\}\)\\log\(1\{\-\}\\sigma\(f\_\{\\theta\}\(x\_\{i\}\)\)\)\\right\]ℒEO\\mathcal\{L\}\_\{\\text\{EO\}\}enforces equalized odds using differentiable soft rates:
ℒEO=\(TPR^0−TPR^1\)2\+\(FPR^0−FPR^1\)2\\mathcal\{L\}\_\{\\text\{EO\}\}=\(\\widehat\{\\text\{TPR\}\}\_\{0\}\-\\widehat\{\\text\{TPR\}\}\_\{1\}\)^\{2\}\+\(\\widehat\{\\text\{FPR\}\}\_\{0\}\-\\widehat\{\\text\{FPR\}\}\_\{1\}\)^\{2\}whereTPR^g=∑i:yi=1,ai=gσ\(fθ\(xi\)\)∑i:yi=1,ai=g1\\widehat\{\\text\{TPR\}\}\_\{g\}=\\frac\{\\sum\_\{i:y\_\{i\}=1,a\_\{i\}=g\}\\sigma\(f\_\{\\theta\}\(x\_\{i\}\)\)\}\{\\sum\_\{i:y\_\{i\}=1,a\_\{i\}=g\}1\}andFPR^g=∑i:yi=0,ai=gσ\(fθ\(xi\)\)∑i:yi=0,ai=g1\\widehat\{\\text\{FPR\}\}\_\{g\}=\\frac\{\\sum\_\{i:y\_\{i\}=0,a\_\{i\}=g\}\\sigma\(f\_\{\\theta\}\(x\_\{i\}\)\)\}\{\\sum\_\{i:y\_\{i\}=0,a\_\{i\}=g\}1\}replace hard predictions with sigmoid outputsσ\(fθ\(x\)\)\\sigma\(f\_\{\\theta\}\(x\)\)to ensure differentiability\. AndℒCEC\\mathcal\{L\}\_\{\\text\{CEC\}\}enforces explanation consistency:
ℒCEC=1m∑i=1m\(‖IGnorm\(xi;byi,ai\)−IGnorm\(x~i;byi,ai\)‖22\)2\\mathcal\{L\}\_\{\\text\{CEC\}\}=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\left\(\\frac\{\\left\\\|\\text\{IG\}\_\{\\text\{norm\}\}\(x\_\{i\};b\_\{y\_\{i\},a\_\{i\}\}\)\-\\text\{IG\}\_\{\\text\{norm\}\}\(\\tilde\{x\}\_\{i\};b\_\{y\_\{i\},a\_\{i\}\}\)\\right\\\|\_\{2\}\}\{2\}\\right\)^\{2\}We square the normalized CEC scores to obtain a differentiable objective that penalizes large deviations more heavily than small ones\. Since individual CEC scores lie in\[0,1\]\[0,1\], the loss is bounded in\[0,1\]\[0,1\], providing a consistent scale relative to the other loss terms\.
The three terms balance utility \(accurate predictions\), outcome fairness \(equalized error rates\), and procedural fairness \(consistent reasoning\)\. An important distinction is that these are not redundant becauseℒEO≈0\\mathcal\{L\}\_\{\\text\{EO\}\}\\approx 0does not implyℒCEC≈0\\mathcal\{L\}\_\{\\text\{CEC\}\}\\approx 0\. Models can achieve perfect equalized odds while using entirely different feature weightings for different groups \(Regime B in our taxonomy\)\. Conversely, explanation consistency alone does not guarantee outcome fairness\. The joint objective ensures both are satisfied simultaneously\. The CEC loss is differentiable through the integrated gradients computation via standard backpropagation, and its\[0,1\]\[0,1\]range ensures stable gradient magnitudes relative to the other loss components\.
### Training Algorithm
Algorithm[2](https://arxiv.org/html/2605.12701#alg2)describes the complete procedure\. Phase 1 preprocessing \(computing baselines and counterfactuals\) runs once before training\. Phase 2 shows the training, then proceeds via standard minibatch SGD with the augmented loss\.
Algorithm 2CEC Training0:Data
𝒟\\mathcal\{D\}, financial features
ℱ\\mathcal\{F\},
λEO,λCEC,η,E,T\\lambda\_\{\\text\{EO\}\},\\lambda\_\{\\text\{CEC\}\},\\eta,E,T
0:Trained model
fθf\_\{\\theta\}
1:// Phase 1: Preprocessing
2:
\{by,a\}y,a∈\{0,1\}←\\\{b\_\{y,a\}\\\}\_\{y,a\\in\\\{0,1\\\}\}\\leftarrowlabel\-group means from
𝒟\\mathcal\{D\}
3:
\{x~i\}←\\\{\\tilde\{x\}\_\{i\}\\\}\\leftarrowAlgorithm[1](https://arxiv.org/html/2605.12701#alg1)on
𝒟\\mathcal\{D\}
4:Initialize parameters
θ\\theta
5:// Phase 2: Training
6:forepoch
=1,…,E=1,\\ldots,Edo
7:foreach minibatch
ℬ=\{\(xi,yi,ai,x~i\)\}\\mathcal\{B\}=\\\{\(x\_\{i\},y\_\{i\},a\_\{i\},\\tilde\{x\}\_\{i\}\)\\\}do
8:
ℒpred←\\mathcal\{L\}\_\{\\text\{pred\}\}\\leftarrowBCE
\(fθ,ℬ\)\(f\_\{\\theta\},\\mathcal\{B\}\)
9:
ℒEO←\(TPR0−TPR1\)2\+\(FPR0−FPR1\)2\\mathcal\{L\}\_\{\\text\{EO\}\}\\leftarrow\(\\text\{TPR\}\_\{0\}\{\-\}\\text\{TPR\}\_\{1\}\)^\{2\}\+\(\\text\{FPR\}\_\{0\}\{\-\}\\text\{FPR\}\_\{1\}\)^\{2\}
10:
ℒCEC←0\\mathcal\{L\}\_\{\\text\{CEC\}\}\\leftarrow 0
11:for
\(xi,x~i\)∈ℬ\(x\_\{i\},\\tilde\{x\}\_\{i\}\)\\in\\mathcal\{B\}do
12:
gi←IG\(fθ,xi,byi,ai,T\)g\_\{i\}\\leftarrow\\text\{IG\}\(f\_\{\\theta\},x\_\{i\},b\_\{y\_\{i\},a\_\{i\}\},T\)
13:
g~i←IG\(fθ,x~i,byi,ai,T\)\\tilde\{g\}\_\{i\}\\leftarrow\\text\{IG\}\(f\_\{\\theta\},\\tilde\{x\}\_\{i\},b\_\{y\_\{i\},a\_\{i\}\},T\)\{Same
bb\}
14:
gi←gi/\(‖gi‖\+ϵ\)g\_\{i\}\\leftarrow g\_\{i\}/\(\\\|g\_\{i\}\\\|\+\\epsilon\);
g~i←g~i/\(‖g~i‖\+ϵ\)\\tilde\{g\}\_\{i\}\\leftarrow\\tilde\{g\}\_\{i\}/\(\\\|\\tilde\{g\}\_\{i\}\\\|\+\\epsilon\)
15:
ℒCEC\+=\(∥gi−g~i∥2/2\)2\\mathcal\{L\}\_\{\\text\{CEC\}\}\\mathrel\{\+\}=\(\\\|g\_\{i\}\-\\tilde\{g\}\_\{i\}\\\|\_\{2\}\\;/\\;2\)^\{2\}
16:endfor
17:
ℒCEC←ℒCEC/\|ℬ\|\\mathcal\{L\}\_\{\\text\{CEC\}\}\\leftarrow\\mathcal\{L\}\_\{\\text\{CEC\}\}/\|\\mathcal\{B\}\|
18:
θ←θ−η∇θ\(ℒpred\+λEOℒEO\+λCECℒCEC\)\\theta\\leftarrow\\theta\-\\eta\\,\\nabla\_\{\\theta\}\(\\mathcal\{L\}\_\{\\text\{pred\}\}\+\\lambda\_\{\\text\{EO\}\}\\mathcal\{L\}\_\{\\text\{EO\}\}\+\\lambda\_\{\\text\{CEC\}\}\\mathcal\{L\}\_\{\\text\{CEC\}\}\)
19:endfor
20:endfor
### Computational Complexity\.
CEC introduces overhead in two phases\. In Phase 1, computing label\-group baselines costsO\(nd\)O\(nd\)wherennis the dataset size andddthe feature dimensionality\. Counterfactual generation via Algorithm[1](https://arxiv.org/html/2605.12701#alg1)requires building four KD\-tree indices atO\(nlogn⋅dℱ\)O\(n\\log n\\cdot d\_\{\\mathcal\{F\}\}\)each, wheredℱ=\|ℱ\|d\_\{\\mathcal\{F\}\}=\|\\mathcal\{F\}\|, followed bynnnearest\-neighbor queries atO\(dℱlogn\)O\(d\_\{\\mathcal\{F\}\}\\log n\)per query, yielding a total preprocessing cost ofO\(nlogn⋅dℱ\)O\(n\\log n\\cdot d\_\{\\mathcal\{F\}\}\)\. In Phase 2, each training batch incurs the standard forward\-pass cost ofO\(md\)O\(md\)for prediction and equalized odds losses, wheremmis the batch size\. The CEC loss dominates: computing integrated gradients for both factual and counterfactual requires2mT2mTforward passes through the network, whereTTis the number of integration steps, giving a per\-batch CEC cost ofO\(mTd\)O\(mTd\)\. The total per\-epoch cost is thereforeO\(nm⋅mTd\)=O\(nTd\)O\\\!\\left\(\\frac\{n\}\{m\}\\cdot mTd\\right\)=O\(nTd\), which is comparable toO\(nd\)O\(nd\)for standard training and an additional overhead factor ofTT\. With GPU parallelism of the integration steps and shared backward pass costs, the computation cost can be lowered by3×3\\timesto5×5\\timesrather than the naïve32×32\\times
## Experimental Setting
Our experiments are designed to answer four questions:
- •RQ1:Can models satisfy outcome fairness \(equalized odds\) while exhibiting hidden procedural bias?
- •RQ2:Does the CEC framework effectively detect and mitigate hidden procedural bias while maintaining outcome fairness and predictive accuracy?
- •RQ3:How do existing fairness interventions perform with respect to procedural fairness metrics?
- •RQ4:Does the effectiveness of CEC generalize across synthetic data, benchmark datasets, and real\-world lending data?
### Baselines
We evaluated CEC against six baselines across four datasets of increasing complexity and realism\. We compared against representative methods from each fairness paradigm:Unconstrained\(standard neural network, no fairness constraint\),Disparate Impact Remover\(DIR\)\(Feldmanet al\.[2015](https://arxiv.org/html/2605.12701#bib.bib116)\)for pre\-processing,Hardt Post\-Processing\(Hardtet al\.[2016](https://arxiv.org/html/2605.12701#bib.bib28)\)for post\-processing,Agarwal Reductions\(Agarwalet al\.[2018](https://arxiv.org/html/2605.12701#bib.bib89)\)andAdversarial Debiasing\(Zhanget al\.[2018](https://arxiv.org/html/2605.12701#bib.bib61)\)for in\-processing, andLagrangian Fair Learning\(Cotteret al\.[2019](https://arxiv.org/html/2605.12701#bib.bib317)\)for constrained optimization\. All methods target equalized odds constraints for direct comparability with CEC\.
### Evaluation Metrics
We evaluated models along three dimensions using complementary metrics\.
- •Utility:F1 score \(harmonic mean of precision and recall\) and AUC \(area under the ROC curve\) measure classification quality\.
- •Outcome Fairness:Equalized odds gap=max\(\|TPR0−TPR1\|,\|FPR0−FPR1\|\)=\\max\(\|\\text\{TPR\}\_\{0\}\-\\text\{TPR\}\_\{1\}\|,\|\\text\{FPR\}\_\{0\}\-\\text\{FPR\}\_\{1\}\|\)and statistical parity gap=\|P\(Y^=1\|A=0\)−P\(Y^=1\|A=1\)\|=\|P\(\\hat\{Y\}\{=\}1\|A\{=\}0\)\-P\(\\hat\{Y\}\{=\}1\|A\{=\}1\)\|capture group\-level outcome disparities\.
- •Procedural Fairness:The*CEC score*\(Definition[3](https://arxiv.org/html/2605.12701#Thmdefinition3)\) measures average explanation consistency\. The*Prediction Flip Rate*\(PFR\)=P\(y^\(x\)≠y^\(x~\)\)=P\(\\hat\{y\}\(x\)\\neq\\hat\{y\}\(\\tilde\{x\}\)\)captures individual\-level outcome instability\. The*regime distribution*reports the fraction of test examples in each of the four regimes \(A–D\) from Table[1](https://arxiv.org/html/2605.12701#Sx1.T1), with particular emphasis on Regime B \(hidden bias\)\.
PFR captures complementary aspects of fairness by detecting outcome\-level instability \(Regimes C and D\)\. A model with low PFR but high CEC is precisely one that exhibits hidden procedural bias, which is the primary target of our framework\.
### Datasets
As mentioned, we evaluated our method on four different data sets\.
Synthetic Data\.We generatedn=10,000n\{=\}10\{,\}000samples withd=20d\{=\}20features partitioned into financial \(\|ℱ\|=10\|\\mathcal\{F\}\|\{=\}10\), proxy \(\|𝒫\|=5\|\\mathcal\{P\}\|\{=\}5\), and noise \(\|𝒩\|=5\|\\mathcal\{N\}\|\{=\}5\) features\. Financial features were generated independently of the protected attributeAA; proxy features were shifted by0\.5σ0\.5\\sigmabetween groups to simulate real\-world correlations \(e\.g\., residential segregation\)\. Ground\-truth labels depended only on financial features:y=𝕀\[∑j∈ℱwjxj\+ϵ\>τ\]y=\\mathbb\{I\}\[\\sum\_\{j\\in\\mathcal\{F\}\}w\_\{j\}x\_\{j\}\+\\epsilon\>\\tau\]\.
German Credit\(Hofmann[1994](https://arxiv.org/html/2605.12701#bib.bib14)\)\. This data set is a standard fairness benchmark with 1,000 instances, 20 features, and binary creditworthiness labels\. We used gender as the protected attribute and definedℱ\\mathcal\{F\}to include credit amount, duration, installment rate, present residence, age, number of existing credits, and number of dependents\. Features encoding gender \(personal status\) and immigration status were excluded fromℱ\\mathcal\{F\}\.
Adult Income\(Becker and Kohavi[1996](https://arxiv.org/html/2605.12701#bib.bib91)\)\. Derived from the 1994 U\.S\. Census, this dataset contains 48,842 instances with demographic and employment features, and a binary label indicating whether annual income exceeds$50K\.\\mathdollar 50K\.We used gender as the protected attribute\. Financial features included capital\-gain, capital\-loss, hours\-per\-week, and occupation category\. Education level and marital status were excluded fromℱ\\mathcal\{F\}as potential demographic proxies\.
HMDA Mortgage Data\(Consumer Financial Protection Bureau[2024](https://arxiv.org/html/2605.12701#bib.bib18)\)\. We used 2024 Home Mortgage Disclosure Act data from the Consumer Financial Protection Bureau, which provides real\-world mortgage lending records with regulatory\-grade detail\. The protected attribute was race \(White vs\. non\-White\)\. Financial features included loan amount, income, debt\-to\-income ratio, property value, loan\-to\-value ratio, and loan term\. We excluded census tract, county code, and applicant ethnicity fromℱ\\mathcal\{F\}\. This dataset provides the most realistic evaluation, reflecting actual lending patterns subject to federal fair lending oversight\.
### Implementation Details
All methods used neural networks with the same architecture, which was tuned to two hidden layers of sizes\[128,64\]\[128,64\]with ReLU activations and dropout rate0\.20\.2\. We trained for 30 epochs using the Adam optimizer with learning rate3×10−43\\times 10^\{\-4\}and batch size 64\. For CEC, we setλEO=1\.0\\lambda\_\{\\text\{EO\}\}=1\.0,λCEC=1\.0\\lambda\_\{\\text\{CEC\}\}=1\.0, andT=32T\{=\}32IG integration steps\. We performed 5\-fold cross\-validation and report means and standard deviations across folds\.
Table 2:Performance comparison across four datasets\. Results are mean±\\pmstd\.↓\\downarrowindicates lower is better \(EO, SP, CEC\), and↑\\uparrowindicates higher is better \(AUC, F1\)\.Boldindicates best performance\.
## Experimental Results
The results in Table[2](https://arxiv.org/html/2605.12701#Sx4.T2)show a clear disconnect between outcome fairness and procedural fairness across all four datasets\. Several baseline methods, such as Adversarial Debiasing and Agarwal Reductions, achieved competitive or even superior equalized odds and statistical parity scores\. Particularly, these methods substantially reduced outcome disparities on multiple datasets\. However, these same methods exhibited CEC scores that are comparable to, or in some cases*worse than*, the unconstrained baseline\. This confirms the motivation of our work, which is that optimizing for outcome fairness does not address hidden procedural bias\. The CEC\-trained model is the only method that consistently achieved the lowest CEC scores across all four datasets, sometimes by a wide margin, while simultaneously delivering competitive equalized odds and statistical parity performance\. Also, the CEC\-trained model did so with only modest reductions in predictive utility, maintaining F1 scores within a few percentage points of the best\-performing baselines on German Credit and Adult Income\. On Synthetic and HMDA, the higher variance in CEC’s utility metrics reflects the sensitivity of the multi\-objective loss at the chosenλ\\lambdaconfiguration\.
The magnitude of CEC’s improvement varied meaningfully across datasets, and these differences reflect the underlying structure of procedural bias in each dataset\. On Synthetic data, where the ground\-truth labeling function uses only financial features and the procedural bias is embedded synthetically, all methods started with relatively low CEC scores, and the gap between the CEC model and baselines was correspondingly modest\. This is expected because the synthetic design isolates procedural bias in a controlled environment rather than in the trained models themselves\. The picture changed dramatically on German Credit, a dataset collected in 1990s Germany when gender\-differentiated lending practices were more prevalent and structurally embedded\. In this dataset, all baseline methods exhibited CEC scores above 0\.53, while the CEC\-model reduced this to 0\.21\. The historical context matters because features like employment tenure carry a gender\-correlated signal from an era of systematic labor market segregation, creating precisely the kind of hidden procedural pathways that outcome\-based methods cannot detect\. Adult Income showed a similar pattern, with the CEC\-model achieving an even more significant reduction from 0\.60 to 0\.08, reflecting the well\-documented structural entanglement of race and gender with income\-determining features in U\.S\. labor data in the 1990s\. On HMDA, which represents recent mortgage lending under active ECOA oversight, baseline CEC scores clustered around 0\.36–0\.40, and the CEC model reduced this to 0\.26\. The smaller relative improvement is expected because modern regulatory pressure has likely reduced the most overt forms of procedural bias in mortgage underwriting, yet a meaningful procedural gap persisted in these experiments\.
### PFR\-CEC Space


Figure 4:Fairness regime plots for German Credit \(left\) and Synthetic \(right\)\. Each method is positioned by its Prediction Flip Rate \(PFR\) and \(CEC\)\. CEC\-model is the only method consistently in Quadrant A on both datasets\.Figure[4](https://arxiv.org/html/2605.12701#Sx5.F4)visualizes the fairness regime landscape for the German Credit and Synthetic datasets by plotting each method in PFR–CEC space, where the four quadrants correspond to the regimes in Table[1](https://arxiv.org/html/2605.12701#Sx1.T1)\. On German Credit, the methods are split into several clusters\. DIR, Hardt, and Agarwal achieve low prediction flip rates but exhibit high CEC, placing them in Quadrant B, which means they equalized predictions across counterfactual pairs while relying on fundamentally different reasoning, demonstrating the hidden procedural bias that motivates our work\. Unconstrained, Lagrangian, and Adversarial showed worse results\. Their high PFR combined with high CEC positions them in Quadrant D, where both outcome and procedural fairness are violated\. The CEC model was the only method that reached Quadrant A, achieving low scores on both axes simultaneously\.
The Synthetic dataset shows a slightly different result\. Here, the Unconstrained, Lagrangian, and Adversarial methods sit near the boundary between Quadrants A and B, reflecting the moderate procedural bias embedded by design\. Meanwhile, DIR, Hardt, and Agarwal are pushed into Quadrant B with higher CEC, suggesting that their fairness interventions inadvertently*increased*explanation instability even as they equalized predictions\. The CEC model occupies the lowest CEC position\. Taken together, the two plots illustrate that Quadrant B is a deficiency that most previous fairness algorithms lack, and the CEC model is the only method among those compared that systematically moves models toward Quadrant A\.
### Regime Distribution


Figure 5:Regime distribution for German Credit \(left\) and Synthetic \(right\)\. Each method shows the percentage of its sample that falls into four taxonomy categories\.Figure[5](https://arxiv.org/html/2605.12701#Sx5.F5)provides a complementary per\-sample view by showing the fraction of test individuals assigned to each fairness regime\. On German Credit, the dominance of Regime B is striking in DIR, Hardt, and Agarwal methods, with approximately 90% of all individuals in the hidden bias regime, meaning that for nearly every applicant, the model produced the correct prediction but arrived at it through group\-dependent reasoning\. Even the Unconstrained model assigned around 69% of samples to Regime B, with an additional 24% in Regime D \(both unfair\)\. The CEC model assigned nearly 80% of samples into Regime A \(fully fair\) while reducing Regime B to under 19%\.
On Synthetic data, the pattern is similar but less extreme\. Most baselines placed 39–99% of samples in Regime B, depending on the method, while the CEC model reduced this to approximately 30% and raised Regime A to 66%\. A notable observation is that outcome\-focused methods like DIR, Hardt, and Agarwal actually*worsened*the regime distribution compared to the Unconstrained baseline\. This happened because these methods force predictions to agree across groups without constraining explanations by converting Regime D samples \(which are mostly detectable by outcome metrics\) into Regime B samples \(which are not detectable by outcome metrics\)\. This conversion effect is why outcome\-oriented fairness metrics cannot mitigate procedural bias\.
### Trade\-Off Analysis
Table 3:Pareto non\-dominance across 5 folds×\\times4 datasets\. We note that CEC model has a structural advantage over baselines due to its multi\-objective nature\.To assess whether procedural fairness comes at the cost of utility or outcome fairness, we compute Pareto non\-dominance across all 5 folds×\\times4 datasets\. A method is non\-dominated on a given fold if no other method achieves strictly better F1, EO gap, and CEC score simultaneously\. Table[3](https://arxiv.org/html/2605.12701#Sx5.T3)reports the results\. The CEC model is the only method that produced a non\-dominated solution on every fold of every dataset \(20/20\), meaning that no baseline ever jointly outperformed the CEC model across all three objectives\. On the other hand, the next\-best methods are Unconstrained \(13/20\), Agarwal \(12/20\), and Adversarial \(12/20\), which were frequently dominated because their strong utility came paired with poor CEC scores that CEC\-model improved upon without sacrificing the other dimensions\. DIR and Hardt were dominated on nearly every fold \(0/20 and 1/20, respectively\), reflecting their tendency to harm both utility and procedural fairness because they did not mitigate bias in modeling, which is the most crucial part\. The result was especially notable on German Credit, where the CEC model was non\-dominated on all 5 folds, while Unconstrained, Lagrangian, and Hardt were each non\-dominated on at most 1 fold, which confirms that on datasets with substantial hidden procedural bias, the CEC model achieved trade\-offs that no other method could match\.
## Ablation Study
### Hyperparameter Sensitivity
Figure 6:Sensitivity Plots ofλEO\\lambda\_\{EO\}andλCEC\\lambda\_\{CEC\}on German DatasetFigure[6](https://arxiv.org/html/2605.12701#Sx6.F6)presents a hyperparameter sensitivity analysis on German Credit, sweepingλCEC\\lambda\_\{\\text\{CEC\}\}andλEO\\lambda\_\{\\text\{EO\}\}over\{0\.1,0\.5,1\.0,2\.0,5\.0,10\.0\}\\\{0\.1,0\.5,1\.0,2\.0,5\.0,10\.0\\\}across six metrics\. The heatmaps reveal that CEC responded primarily toλCEC\\lambda\_\{\\text\{CEC\}\}: dropped sharply betweenλCEC=0\.1\\lambda\_\{\\text\{CEC\}\}=0\.1andλCEC=0\.5\\lambda\_\{\\text\{CEC\}\}=0\.5and plateaued thereafter, indicating that even modest procedural fairness pressure yielded substantial gains\. Regime B and Regime A followed the same but alternating pattern, falling from over 50% atλCEC=0\.1\\lambda\_\{\\text\{CEC\}\}=0\.1to approximately 20% forλCEC≥1\.0\\lambda\_\{\\text\{CEC\}\}\\geq 1\.0, largely independent ofλEO\\lambda\_\{\\text\{EO\}\}\. F1 was remarkably stable across the moderate regularization region \(λCEC≤1\.0\\lambda\_\{\\text\{CEC\}\}\\leq 1\.0\), remaining above 0\.82 regardless ofλEO\\lambda\_\{\\text\{EO\}\}, with meaningful degradation occurring only atλCEC≥2\.0\\lambda\_\{\\text\{CEC\}\}\\geq 2\.0\. The EO gap exhibited more complex behavior: it was reliably near zero in the low\-to\-moderateλCEC\\lambda\_\{\\text\{CEC\}\}region \(≤1\.0\\leq 1\.0\) but became erratic at higher values, where aggressive CEC regularization pushed the model toward near\-constant predictions that destabilized group\-wise rate estimates\. PFR remained low throughout, confirming that prediction stability was preserved even under strong regularization\. Taken together, the heatmaps identify a favorable operating region atλCEC∈\[0\.5,1\.0\]\\lambda\_\{\\text\{CEC\}\}\\in\[0\.5,1\.0\]andλEO≤2\.0\\lambda\_\{\\text\{EO\}\}\\leq 2\.0where procedural fairness improved substantially, outcome fairness was maintained, and utility cost was negligible\. This sensitivity analysis confirms that our default configuration \(λCEC=1\.0\\lambda\_\{\\text\{CEC\}\}=1\.0,λEO=1\.0\\lambda\_\{\\text\{EO\}\}=1\.0\) was well\-situated within an optimal region\.
### Effect of Composite Loss Function
Table 4:Performance of different parts of the loss functionTo isolate the contribution of each loss component, Table[4](https://arxiv.org/html/2605.12701#Sx6.T4)reports results for four training variants on German Credit: prediction loss only, prediction with equalized odds, prediction with CEC, and the full objective\. The most striking finding is that adding EO regularization alone \(Pred \+ EO\) reduced neither the CEC score nor the Regime B fraction compared to the prediction\-only baseline, in fact, Regime B slightly*increased*from 70\.5% to 71\.4%, confirming that outcome fairness constraints are orthogonal to procedural fairness and can even marginally exacerbate hidden bias\. Conversely, adding CEC regularization alone \(Pred \+ CEC\) dramatically reduced the CEC score from 0\.578 to 0\.233 and Regime B from 70\.5% to 21\.0%, while also driving the EO gap to zero as a side effect on German Credit data, suggesting that enforcing consistent reasoning across groups naturally promotes outcome equity even without explicit EO constraints\. The full model \(Pred \+ EO \+ CEC\) achieved the best F1 while maintaining strong CEC and EO performance, with a slight increase in EO gap relative to the CEC\-only variant as the three objectives negotiated their trade\-off\. These results demonstrate that CEC is the essential component for procedural fairness and that its benefits are largely complementary to, rather than redundant with, outcome fairness regularization\.
Figure 7:Regime distribution of each part of the loss function\.Figure[7](https://arxiv.org/html/2605.12701#Sx6.F7)visualizes these transitions as stacked regime distributions\. The figure shows that Pred \+ EO could not mitigate hidden procedural bias, but the contrast is seen in Pred \+ CEC, which the dominant Regime B mass is shifted into Regime A\. The full model retained nearly all of CEC’s regime improvements while the EO component provided a small increase in Regime C and D samples\.
## Discussion
Our results provide clear answers to each research question and also expose an important blind spot in the current algorithmic fairness paradigm\.RQ1asked whether outcome\-fair models can harbor hidden procedural bias\. The answer was shown to be yes across all four datasets; every baseline method that achieved competitive equalized odds scores simultaneously exhibited CEC scores comparable to or worse than the unconstrained model\. The regime distribution analysis makes this explicit on German Credit, where DIR, Hardt, and Agarwal place nearly 90% of individuals in Regime B, meaning that for the vast majority of applicants, the model produced an outcome fair prediction but arrived at it through group\-dependent reasoning\. The ablation study sharpens this finding further, as adding EO regularization alone did not reduce Regime B at all\. In fact, it made unfairness more severe for those individuals\. This demonstrates that outcome fairness and procedural fairness are orthogonal and optimizing for one does not address other\.
RQ2andRQ3are best answered jointly\. The CEC model was the only method that consistently achieved the lowest CEC scores and Regime B fractions across all datasets, while simultaneously delivering competitive equalized odds and statistical parity performance\. The ablation shows that regularization alone drove the EO gap to zero as a side effect, suggesting that enforcing consistent reasoning across groups naturally promotes outcome fairness equity but the reverse does not hold\. Existing fairness interventions, whether they operate through data transformation \(DIR\), threshold adjustment \(Hardt\), constrained optimization \(Agarwal, Lagrangian\), or adversarial training, all failed to reduce procedural bias because they optimize over predictions without attending to the model’s internal explanation structure\. The sensitivity analysis confirms that these gains are robust, as CEC and Regime B improve sharply at moderateλCEC\\lambda\_\{\\text\{CEC\}\}values while F1 remains above 0\.82, and the favorable operating region is broad rather than a narrow sweet spot\.
RQ4asked whether CEC generalizes across data settings of varying complexity and realism\. The progression from synthetic data \(controlled bias, known ground truth\) through benchmark datasets \(German Credit from 1990s Germany, Adult Income from U\.S\. census data\) to real\-world mortgage lending \(HMDA under active ECOA oversight\) demonstrates that CEC model effectiveness is not an artifact of any single domain\. Notably, the magnitude of improvement varies in ways that are consistent with the structural properties of each dataset: German Credit and Adult Income, where historical patterns of gender and racial segregation are deeply embedded in feature distributions, showed the largest CEC reductions, while HMDA, where modern regulatory pressure has already reduced the most overt forms of procedural bias, showed a smaller but still meaningful improvement\. These findings suggest that procedural bias is a pervasive feature of real lending systems and it persists even under regulatory oversight, and that only becomes visible when the right diagnostic tools are applied\.
Our findings carry direct consequences for fair lending compliance\. Under ECOA and Regulation B, lenders are prohibited from applying different underwriting standards to different demographic groups, and this prohibition maps precisely onto Regime B in our taxonomy\. Current model risk management practices, including those outlined in the US Federal Reserve’s SR 11\-7 guidance, evaluate models primarily through outcome\-based disparate impact testing\. Our results demonstrate that this approach might be insufficient because a model can pass every standard disparate impact test while systematically applying different reasoning to applicants from different groups\. This gap is particularly concerning as financial institutions increasingly adopt complex neural network models that are harder to audit through traditional means\. CEC offers a practical tool to close this gap\. At the model development stage, CEC regularization can be incorporated into training pipelines to prevent procedural bias from arising in the first place\. At the audit stage, the CEC score and regime distribution provide quantitative evidence of whether a model’s reasoning is consistent across groups\. The per\-individual nature of CEC is especially relevant because, rather than reporting aggregate group statistics, it can identify specific applicants who were subjected to inconsistent reasoning, enabling targeted remediation\. As regulators worldwide move toward requiring explainability in automated credit decisions, frameworks that audit, not just what models decide but how they reason, will become increasingly essential\.
## Conclusion
We introduced Counterfactual Explanation Consistency \(CEC\), a framework for detecting and mitigating hidden procedural bias in automated lending models\. Our work is motivated by a fundamental gap in the algorithmic fairness literature, where existing methods ensure that models produce equitable outcomes across demographic groups but do not examine whether models arrive at those outcomes through consistent reasoning\. We formalized this gap through a four\-regime taxonomy and showed, both theoretically and empirically, that outcome fairness and procedural fairness are orthogonal and optimizing for one can actively undermine the other\.
CEC addresses this gap by proposing nearest\-neighbor counterfactual generation that produces realistic matched pairs without causal assumptions, a consistent baseline principle that isolates discriminatory reasoning from reference\-point artifacts in integrated gradient comparisons, and a differentiable training loss that encourages explanation consistency alongside accuracy and outcome fairness\. Our experiments across synthetic data, German Credit, Adult Income, and HMDA mortgage data demonstrated that all six baseline fairness methods leave the majority of individuals in Regime B \(same prediction, different reasoning\), while CEC shifts the dominant regime to Regime A \(fully fair\) with minimal utility cost\. The ablation study reveals that CEC regularization alone drove the equalized odds gap to near zero as a side effect, while the reverse did not hold, suggesting that procedural fairness may be the more fundamental objective from which outcome fairness follows naturally\.
## Limitations and Future Work
Limitations\.Our framework has several limitations that warrant acknowledgment\. First, CEC relies on integrated gradients as the attribution method, which assumes that the straight\-line path from baseline to input traverses a meaningful region of the model’s decision space\. Alternative attribution methods, such as SHAP or attention\-based explanations, may yield different consistency assessments\. However, we leave the comparative study of attribution methods within the CEC framework to future work\. Second, the financial feature setℱ\\mathcal\{F\}must be specified by domain experts, and different choices ofℱ\\mathcal\{F\}may lead to different counterfactual matches and CEC scores\. While this is inherent to any fairness method that distinguishes legitimate from illegitimate features, it introduces a degree of subjectivity that practitioners must navigate\. Third, our experiments use binary protected attributes; extending CEC to multi\-valued or intersectional attributes would require addressing combinatorial growth in the number of counterfactual comparisons and label\-group baselines\. Fourth, the computational overhead of integrated gradients \(approximately3×3\\timesto5×5\\timesstandard training time\) may limit applicability to very large\-scale models, though this cost is incurred only during training and not at inference time\.
Future Work\.Several directions merit further investigation\. CEC scores can be decomposed at the feature level to identify which specific features exhibit the largest attribution disparities across groups, enabling targeted model debugging and more interpretable fairness audits\. The relationship between CEC and causal notions of fairness deserves formal analysis: while our nearest\-neighbor matching avoids causal assumptions, understanding when and whether CEC scores align with causal path\-specific effects would strengthen the theoretical foundations\. Extending the framework to continuous or multi\-valued protected attributes, regression settings, and non\-tabular domains such as text or image\-based lending decisions would also broaden its applicability\.
## References
- E\. C\. O\. Act \(2018\)Equal credit opportunity act\.Women in the American Political System: An Encyclopedia of Women as Voters, Candidates, and Office Holders2,pp\. 129\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p1.1),[Introduction](https://arxiv.org/html/2605.12701#Sx1.p3.1)\.
- F\. H\. Act \(1968\)Fair housing act\.Home Mortgage Disclosure Act, and Community\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p1.1)\.
- A\. Agarwal, A\. Beygelzimer, M\. Dudík, J\. Langford, and H\. Wallach \(2018\)A reductions approach to fair classification\.InProceedings of the 35th International Conference on Machine Learning \(ICML\),pp\. 60–69\.Cited by:[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p2.1),[Baselines](https://arxiv.org/html/2605.12701#Sx4.SSx1.p1.1)\.
- C\. Agarwal, S\. Krishna, E\. Saxena, M\. Pawelczyk, N\. Johnson, I\. Puri, M\. Zitnik, and H\. Lakkaraju \(2022\)Openxai: towards a transparent evaluation of model explanations\.Advances in Neural Information Processing Systems35,pp\. 15784–15799\.Cited by:[Explainability and Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx2.p1.1)\.
- B\. Becker and R\. Kohavi \(1996\)Adult\.Note:UCI Machine Learning RepositoryDOI: https://doi\.org/10\.24432/C5XW20Cited by:[Datasets](https://arxiv.org/html/2605.12701#Sx4.SSx3.p4.2)\.
- T\. Begley, T\. Schwedes, C\. Frye, and I\. Feige \(2020\)Explainability for fair machine learning\.arXiv preprint arXiv:2010\.07389\.Cited by:[Explainability and Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx2.p1.1)\.
- O\. A\. Bello \(2023\)Machine learning algorithms for credit risk assessment: an economic and financial analysis\.International Journal of Management10\(1\),pp\. 109–133\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p1.1)\.
- S\. Chiappa \(2019\)Path\-specific counterfactual fairness\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.33,pp\. 7801–7808\.Cited by:[Counterfactual Fairness and Reasoning](https://arxiv.org/html/2605.12701#Sx2.SSx4.p1.1)\.
- A\. Chouldechova \(2017\)Fair prediction with disparate impact: a study of bias in recidivism prediction instruments\.Big Data5\(2\),pp\. 153–163\.Cited by:[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p1.1)\.
- Consumer Financial Protection Bureau \(2024\)HMDA data\.Note:https://ffiec\.cfpb\.gov/data\-browser/data/2024?category=states&items=CACited by:[Datasets](https://arxiv.org/html/2605.12701#Sx4.SSx3.p5.1)\.
- A\. Cotter, H\. Jiang, M\. Gupta, S\. Wang, T\. Narayan, S\. You, and K\. Sridharan \(2019\)Optimization with non\-differentiable constraints with applications to fairness, recall, churn, and other goals\.Journal of Machine Learning Research20\(172\),pp\. 1–59\.Cited by:[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p2.1),[Multi\-Objective Fair Learning](https://arxiv.org/html/2605.12701#Sx2.SSx5.p1.1),[Baselines](https://arxiv.org/html/2605.12701#Sx4.SSx1.p1.1)\.
- J\. Dai, S\. Upadhyay, U\. Aivodji, S\. H\. Bach, and H\. Lakkaraju \(2022\)Fairness via explanation quality: evaluating disparities in the quality of post hoc explanations\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,pp\. 203–214\.Cited by:[Explainability and Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx2.p1.1)\.
- X\. Dastile, T\. Celik, and M\. Potsane \(2020\)Statistical and machine learning models in credit scoring: a systematic literature survey\.Applied Soft Computing91,pp\. 106263\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p1.1)\.
- C\. Dwork, M\. Hardt, T\. Pitassi, O\. Reingold, and R\. Zemel \(2012\)Fairness through awareness\.InProceedings of the 3rd Innovations in Theoretical Computer Science Conference,pp\. 214–226\.Cited by:[Procedural Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx3.p1.1)\.
- M\. Feldman, S\. A\. Friedler, J\. Moeller, C\. Scheidegger, and S\. Venkatasubramanian \(2015\)Certifying and removing disparate impact\.InProceedings of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 259–268\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p2.1),[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p1.1),[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p2.1),[Baselines](https://arxiv.org/html/2605.12701#Sx4.SSx1.p1.1)\.
- J\. Germino, Y\. Zhao, T\. Derr, N\. Moniz, and N\. V\. Chawla \(2025\)Explanation difference: bridging procedural and distributional fairness\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,Vol\.8,pp\. 1078–1090\.Cited by:[Procedural Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx3.p1.1)\.
- N\. Grgić\-Hlača, M\. B\. Zafar, K\. P\. Gummadi, and A\. Weller \(2018\)Beyond distributive fairness in algorithmic decision making: feature selection for procedurally fair learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[Procedural Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx3.p1.1)\.
- M\. Hardt, E\. Price, and N\. Srebro \(2016\)Equality of opportunity in supervised learning\.Advances in Neural Information Processing Systems29\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p2.1),[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p1.1),[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p2.1),[Baselines](https://arxiv.org/html/2605.12701#Sx4.SSx1.p1.1)\.
- H\. Hofmann \(1994\)Statlog \(German Credit Data\)\.Note:UCI Machine Learning RepositoryDOI: https://doi\.org/10\.24432/C5NC77Cited by:[Datasets](https://arxiv.org/html/2605.12701#Sx4.SSx3.p3.2)\.
- F\. Kamiran and T\. Calders \(2012\)Data preprocessing techniques for classification without discrimination\.Knowledge and Information Systems33\(1\),pp\. 1–33\.Cited by:[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p2.1)\.
- J\. Kleinberg, S\. Mullainathan, and M\. Raghavan \(2017\)Inherent trade\-offs in the fair determination of risk scores\.InProceedings of the 8th Innovations in Theoretical Computer Science Conference \(ITCS\),Cited by:[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p1.1)\.
- M\. J\. Kusner, J\. Loftus, C\. Russell, and R\. Silva \(2017\)Counterfactual fairness\.Advances in Neural Information Processing Systems30\.Cited by:[Counterfactual Fairness and Reasoning](https://arxiv.org/html/2605.12701#Sx2.SSx4.p1.1)\.
- G\. S\. Leventhal \(1980\)What should be done with equity theory? new approaches to the study of fairness in social relationships\.InSocial Exchange: Advances in Theory and Research,pp\. 27–55\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p3.1),[Procedural Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx3.p1.1)\.
- S\. M\. Lundberg and S\. Lee \(2017\)A unified approach to interpreting model predictions\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30\.Cited by:[Explainability and Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx2.p1.1)\.
- R\. K\. Mothilal, A\. Sharma, and C\. Tan \(2020\)Explaining machine learning classifiers through diverse counterfactual explanations\.InProceedings of the ACM Conference on Fairness, Accountability, and Transparency \(FAccT\),pp\. 607–617\.Cited by:[Counterfactual Fairness and Reasoning](https://arxiv.org/html/2605.12701#Sx2.SSx4.p2.1)\.
- R\. Nagpal, R\. Shahsavarifar, V\. Goyal, and A\. Gupta \(2025\)Optimizing fairness and accuracy: a pareto optimal approach for decision\-making\.AI and Ethics5\(2\),pp\. 1743–1756\.Cited by:[Multi\-Objective Fair Learning](https://arxiv.org/html/2605.12701#Sx2.SSx5.p1.1)\.
- M\. Pawelczyk, S\. Bielawski, J\. v\. d\. Heuvel, T\. Richter, and G\. Kasneci \(2021\)Carla: a python library to benchmark algorithmic recourse and counterfactual explanation algorithms\.arXiv preprint arXiv:2108\.00783\.Cited by:[Counterfactual Fairness and Reasoning](https://arxiv.org/html/2605.12701#Sx2.SSx4.p2.1)\.
- G\. Popoola and J\. Sheppard \(2024\)Investigating and mitigating the performance–fairness tradeoff via protected\-category sampling\.Electronics13\(15\),pp\. 3024\.Cited by:[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p2.1)\.
- R\. Poyiadzi, K\. Sokol, R\. Santos\-Rodriguez, T\. De Bie, and P\. Flach \(2020\)FACE: feasible and actionable counterfactual explanations\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society \(AIES\),pp\. 344–350\.Cited by:[Counterfactual Fairness and Reasoning](https://arxiv.org/html/2605.12701#Sx2.SSx4.p2.1)\.
- J\. Rawls \(1971\)A theory of justice\.Harvard University Press\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p3.1)\.
- M\. T\. Ribeiro, S\. Singh, and C\. Guestrin \(2016\)” Why should i trust you?” explaining the predictions of any classifier\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pp\. 1135–1144\.Cited by:[Explainability and Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx2.p1.1)\.
- P\. Schwab, L\. Linhardt, and W\. Karlen \(2018\)Perfect match: a simple method for learning representations for counterfactual inference with neural networks\.arXiv preprint arXiv:1810\.00656\.Cited by:[Label\-Stratified Nearest\-Neighbor Matching\.](https://arxiv.org/html/2605.12701#Sx3.SSx2.SSSx2.p2.1)\.
- D\. Slack, S\. Hilgard, E\. Jia, S\. Singh, and H\. Lakkaraju \(2020\)Fooling lime and shap: adversarial attacks on post hoc explanation methods\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,pp\. 180–186\.Cited by:[Explainability and Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx2.p1.1)\.
- M\. Sundararajan, A\. Taly, and Q\. Yan \(2017\)Axiomatic attribution for deep networks\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 3319–3328\.Cited by:[Explainability and Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx2.p1.1),[Integrated Gradients Background\.](https://arxiv.org/html/2605.12701#Sx3.SSx3.SSSx1.p1.2)\.
- J\. Thibaut, L\. Walker, S\. LaTour, and P\. Houlden \(1973\)Procedural justice as fairness\.Stanford Law Review26,pp\. 1271\.Cited by:[Introduction](https://arxiv.org/html/2605.12701#Sx1.p3.1)\.
- Z\. Wang, Q\. Zeng, W\. Lin, M\. Jiang, and K\. C\. Tan \(2024\)Generating diagnostic and actionable explanations for fair graph neural networks\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 21690–21698\.Cited by:[Multi\-Objective Fair Learning](https://arxiv.org/html/2605.12701#Sx2.SSx5.p1.1)\.
- C\. Wei and M\. Niethammer \(2022\)Fairness and accuracy under domain generalization\.InProceedings of the 10th International Conference on Learning Representations \(ICLR\),Cited by:[Multi\-Objective Fair Learning](https://arxiv.org/html/2605.12701#Sx2.SSx5.p1.1)\.
- Y\. Wu, L\. Zhang, and X\. Wu \(2019\)Counterfactual fairness: unidentification, bound and algorithm\.InProceedings of the 28th International Joint Conference on Artificial Intelligence,Cited by:[Counterfactual Fairness and Reasoning](https://arxiv.org/html/2605.12701#Sx2.SSx4.p1.1)\.
- B\. H\. Zhang, B\. Lemoine, and M\. Mitchell \(2018\)Mitigating unwanted biases with adversarial learning\.InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society,pp\. 335–340\.Cited by:[Outcome Fairness in Machine Learning](https://arxiv.org/html/2605.12701#Sx2.SSx1.p2.1),[Baselines](https://arxiv.org/html/2605.12701#Sx4.SSx1.p1.1)\.
- T\. Zhao, A\. Wang, and T\. Derr \(2023\)Fairness and explainability: bridging the gap towards fair model explanations\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 14919–14928\.Cited by:[Procedural Fairness](https://arxiv.org/html/2605.12701#Sx2.SSx3.p1.1)\.Similar Articles
Explanation Fairness in Large Language Models: An Empirical Analysis of Disparities in How LLMs Justify Decisions Across Demographic Groups
This paper introduces the Explanation Fairness Taxonomy (EFT) to analyze disparities in how LLMs justify decisions across demographic groups, finding significant biases in explanation quality and tone despite balanced decisions.
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
GESD: Beyond Outcome-Oriented Fairness
This paper proposes GESD, a procedural-oriented fairness metric that measures disparities in explanation stability across subgroups, and integrates it into a multi-objective optimization framework for jointly optimizing utility, outcome fairness, and explanation fairness.
Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
This paper studies how instruction-tuned LLMs can exhibit fair outputs while retaining biased internal representations in high-stakes decisions like mortgage underwriting, showing that these hidden biases are causally potent, asymmetric, and exploitable through activation steering.
Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
This paper introduces a causal framework to quantify rationalization bias in LLM judges, where verdicts and explanations are influenced by non-evidential cues rather than underlying texts. It proposes cue interventions, anchoring metrics, and the Proof-Before-Preference mitigation protocol, demonstrating improved cue invariance.