Fair and Calibrated Toxicity Detection with Robust Training and Abstention
Summary
This paper studies fairness in toxicity classification across three axes: ranking, calibration, and abstention. It compares ERM, reweighted ERM, and Group DRO methods with post-hoc interventions, finding that calibration disparity is a hidden fairness violation and that abstention itself can be unfair.
View Cached Full Text
Cached at: 05/15/26, 06:27 AM
# Fair and Calibrated Toxicity Detection with Robust Training and Abstention
Source: [https://arxiv.org/html/2605.14074](https://arxiv.org/html/2605.14074)
###### Abstract
Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention\. Training\-time interventions and post\-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter\. We compare Empirical Risk Minimization \(ERM\), instance\-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence\-based abstention, and per\-identity threshold optimization\. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per\-subgroup Expected Calibration Error \(ECE\) with bootstrap CIs \(n=1000n=1000\)\.
We report four findings\. \(1\)Calibration disparity is a hidden fairness violation\.ERM has near\-perfect aggregate calibration \(0\.0130\.013\) but is significantly miscalibrated on every identity subgroup \(\+0\.029\+0\.029to\+0\.134\+0\.134\)\. \(2\)Training interventions reshape rather than eliminate disparity\.Reweighted ERM improves ranking \(BPSN AUC\+0\.06\+0\.06to\+0\.12\+0\.12\) but worsens the calibration\-fairness gap by up to\+0\.232\+0\.232\. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally \(ECE0\.1180\.118\)\. \(3\)Post\-hoc methods inherit training failure modes\.Temperature scaling fails because miscalibration is non\-uniform\. Confidence\-based abstention works under ERM but breaks under DRO, where the risk\-coverage curve rises with deferral\. \(4\)Abstention itself is unfair\.Confidence\-based deferral helps background content far more than identity\-mentioning content\. We argue that SRAI fairness requires a multi\-axis framework: methods equivalent on aggregate ranking differ sharply in failure modes that determine real\-world harm\.
## 1Problem and Goal
Problem Statement\.Toxicity classifiers often learn spurious correlations between identity mentions and toxicity labels\. Because hate speech frequently co\-occurs with mentions of protected groups, models treat identity terms as signals for toxicity\. This causes neutral or anti\-racist content to be disproportionately penalized, producing systematic false positives\. At a fixed probability threshold, these correlations become disparate moderation outcomes\.
Integrated Fairness Axes\.Standard evaluations focus on ranking metrics \(BPSN/BNSP AUC\) and treat calibration or abstention as separate quality concerns\. We argue this separation is misleading for three reasons\. First, calibration disparity is a fairness violation: if confidence corresponds to different accuracies across groups, threshold\-based decisions are inherently biased\. Second, abstention efficacy is a fairness property: confidence\-based deferral only works if confidence tracks correctness uniformly\. Third, post\-hoc interventions inherit failure modes from training\. Temperature scaling, threshold optimization, and abstention succeed or fail based on how the training method shaped the probability distribution\.
Mapping to SRAI Principles\.The three axes correspond to core Socially Responsible AI functions\.Protectionfrom disparate flagging requires fair ranking and fair calibration\.Information\(uncertainty reporting\) requires per\-subgroup calibration parity so that a confidence score remains meaningful across all content\.Prevention and Mitigationthrough deferral requires that abstention works uniformly: if deferral is effective only on background content, the safety net is unfair\.
Contributions\.
1. 1\.Multi\-Method Comparison\.We evaluate ERM, Reweighted ERM, and Group DRO combined with three post\-hoc interventions, using paired bootstrap 95% CIs \(n=1000n=1000\) for all metrics\.
2. 2\.Calibration as Fairness\.We introduce the calibration\-fairness gap \(subgroup ECE minus background ECE\)\. We show ERM is significantly miscalibrated on every identity subgroup despite near\-perfect aggregate calibration\.
3. 3\.Non\-Dominated Trade\-offs\.We characterize training methods as occupying distinct positions: ERM has the worst ranking but cleanest abstention; Reweighted ERM has the best ranking but worsens calibration gaps; Group DRO fixes calibration parity but breaks abstention\.
4. 4\.Post\-Hoc Dependency\.We demonstrate that intervention efficacy is determined by the training method\. Abstention works under ERM, partially under Reweighted ERM, and fails under Group DRO because DRO’s training decouples confidence from correctness\.
## 2Related Work
Bias in toxicity classification\.Dixon et al\.\([dixon2018measuring,](https://arxiv.org/html/2605.14074#bib.bib2)\)documented disparate error rates in early classifiers, while Borkan et al\.\([borkan2019nuanced,](https://arxiv.org/html/2605.14074#bib.bib1)\)formalized the BPSN and BNSP AUC metrics and released the Civil Comments dataset\. Garg et al\.\([garg2019counterfactual,](https://arxiv.org/html/2605.14074#bib.bib10)\)established that ranking metrics can mask threshold\-level deployment failures, motivating our integrated framework pairing ranking with calibration and tail\-distribution analysis\.
Group robustness\.Sagawa et al\.\([sagawa2020distributionally,](https://arxiv.org/html/2605.14074#bib.bib7)\)formulated Group DRO for fair learning, though later work suggests reweighting often matches its performance\([idrissi2022simple,](https://arxiv.org/html/2605.14074#bib.bib5)\)\. We replicate this on Civil Comments and show that DRO and reweighting diverge sharply on calibration fairness, an axis prior comparisons did not measure\.
Calibration as fairness\.Guo et al\.\([guo2017calibration,](https://arxiv.org/html/2605.14074#bib.bib4)\)introduced temperature scaling to address neural network miscalibration\. Pleiss et al\.\([pleiss2017calibrationfairness,](https://arxiv.org/html/2605.14074#bib.bib11)\)argue that calibration parity across groups is a vital fairness criterion\. Our work uses per\-subgroup ECE with bootstrap CIs to connect this parity to downstream abstention efficacy\.
Selective prediction\.Geifman and El\-Yaniv\([geifman2017selective,](https://arxiv.org/html/2605.14074#bib.bib3)\)formalized confidence\-based abstention via risk–coverage curves\. We evaluate this connection per subgroup, showing that abstention efficacy varies systematically across identity groups, making the safety mechanism itself a fairness concern\.
Hate speech generalization\.Mathew et al\.\([mathew2021hatexplain,](https://arxiv.org/html/2605.14074#bib.bib6)\)released HateXplain for explainability\. We use it for zero\-shot transfer to frame cross\-dataset performance as a deployment and generalization concern rather than a direct fairness comparison\.
## 3Data
Civil Comments\.We use the Jigsaw dataset\([borkan2019nuanced,](https://arxiv.org/html/2605.14074#bib.bib1)\)containing 1\.8 million comments\. Toxicity and identity\-mention scores are binarized at0\.50\.5, yielding an8%8\\%positive rate\. We stratified\-downsample to200,000200,000examples, split80/10/1080/10/10into train \(160,000160,000\), validation \(20,00020,000\), and test \(20,00020,000\) usingrandom\_state=42\.
Group Assignment\.Examples are assigned to groupg=\(identity,y\)g=\(\\text\{identity\},y\)for the first identity mentioned\. The test set includes 18,217 background examples \(no identity mentioned\) and eight identity groups with sufficient support: white \(276\), muslim \(247\), gay/lesbian \(129\), black \(146\), jewish \(83\), christian, female, and male\. Groups withn<50n<50\(e\.g\., hindu, atheist\) are excluded from reported metrics to ensure stable bootstrap estimates\.
HateXplain\.For zero\-shot transfer, we use HateXplain\([mathew2021hatexplain,](https://arxiv.org/html/2605.14074#bib.bib6)\)\. Majority\-vote labels maphatespeechtoy=1y=1and others toy=0y=0\. The test split contains1,9241,924examples \(30\.9%30\.9\\%toxic\)\. We use this to evaluate generalization across domain gaps between news comments and social media posts\.
## 4Methods
We categorize methods into training\-time interventions, which produce distinct models, and post\-hoc interventions, which operate on trained outputs\. These are coupled: the efficacy of a post\-hoc mechanism is determined by the training method\.
ERM Baseline\.We fine\-tunedistilbert\-base\-uncasedwith cross\-entropy for 2 epochs, batch size 16, and a linear learning\-rate schedule \(5×10−55\\times 10^\{\-5\}to 0\)\. ERM serves as the baseline, minimizing average loss uniformly with no subgroup awareness\.
Reweighted ERM\.We apply per\-example weightswi=N/\(G⋅ngi\)w\_\{i\}=N/\(G\\cdot n\_\{g\_\{i\}\}\)based on group frequency\. Weights are clipped at 50\.0 to prevent rare groups from dominating gradients\. This serves as a middle ground between ERM and adaptive DRO\.
Group DRO\.We implement adaptive per\-group weightsqgq\_\{g\}updated each batch viaqg←qg⋅exp\(ηLg\)q\_\{g\}\\leftarrow q\_\{g\}\\cdot\\exp\(\\eta L\_\{g\}\)withη=0\.001\\eta=0\.001\. This minimax objective focuses on the highest\-loss group\.
Temperature Scaling\.We learn a scalarT∈\[0\.5,5\.0\]T\\in\[0\.5,5\.0\]via validation NLL grid search to producepcal=softmax\(𝐳/T\)p\_\{\\text\{cal\}\}=\\mathrm\{softmax\}\(\\mathbf\{z\}/T\)\. This corrects uniform miscalibration but cannot fix selective, subgroup\-specific errors\.
Confidence\-based Abstention\.We compute confidence asmax\(p\(x\),1−p\(x\)\)\\max\(p\(x\),1\-p\(x\)\)\. At coveragecc, we retain the topcc\-fraction of predictions to compute error rates, creating risk\-coverage curves\. This safety mechanism assumes confidence tracks correctness uniformly across subgroups\.
Per\-identity Threshold Optimization\.We grid searchτg∈\[0\.1,0\.9\]\\tau\_\{g\}\\in\[0\.1,0\.9\]on validation data to minimize the absolute error gap between subgroups and background\. This uniform\-shift correction can only repair bias that manifests as a constant probability offset\.
Calibration Fairness Gap\.We compute ECE separately on each subgroup and the background\. The calibration\-fairness gap isΔECE\(g\)=ECE\(g\)−ECE\(background\)\\Delta\\text\{ECE\}\(g\)=\\text\{ECE\}\(g\)\-\\text\{ECE\}\(\\text\{background\}\)using 15 equal\-width bins\. A gap with a CI excluding zero indicates a fairness violation regardless of ranking performance\.
Statistical Inference\.For all estimates, we run 1000 paired bootstrap iterations\. We report means and 95% CIs \(2\.5/97\.52\.5/97\.5percentiles\)\. Differences are significant only if the CI excludes zero\.
## 5Evaluation Framework
Fairness is a multi\-axis property\. Table[1](https://arxiv.org/html/2605.14074#S5.T1)maps these axes to specific metrics and interventions\.
Table 1:Integrated evaluation framework\. All results include paired bootstrap 95% CIs\.Axis / InteractionMetricMethodsPurposeRanking fairnessSubgroup, BPSN, BNSP AUCERM, Reweighted, DROMeasure toxicity ordering within and across subgroups\.Calibration fairnessSubgroup ECE, ECE gapERM, Reweighted, DROCheck if confidence scores are reliable across different groups\.Tail behavior% benignp\>0\.9p\>0\.9ERM, Reweighted, DROIdentify "confident\-wrong" errors on identity\-mentioning content\.Threshold parityError gap atτ=0\.5\\tau=0\.5ERM, Reweighted, DROEvaluate deployment\-level error rates before optimization\.Post\-hoc couplingT∗T^\{\*\},τg∗\\tau\_\{g\}^\{\*\}, Risk atccERM, Reweighted, DRO × T\-scaling, abstention, threshold opt\.sTest if post\-hoc fixes can repair the specific errors of each trainer\.GeneralizationAUC, ECE, BPSNERM \(HateXplain\)Probe cross\-dataset transfer as a deployment concern\.
## 6Results
Results are organized around three fairness axes: ranking, calibration, and abstention\. Section[6\.1](https://arxiv.org/html/2605.14074#S6.SS1)establishes the ERM baseline\. Section[6\.2](https://arxiv.org/html/2605.14074#S6.SS2)shows how fairness methods reshape the axes\. Section[6\.3](https://arxiv.org/html/2605.14074#S6.SS3)shows that post\-hoc interventions inherit each training method’s failure modes\. Section[6\.4](https://arxiv.org/html/2605.14074#S6.SS4)synthesizes the trade\-off\. Section[6\.5](https://arxiv.org/html/2605.14074#S6.SS5)grounds findings in failure cases\. SectionLABEL:sec:hatexplainaddresses zero\-shot transfer\.
### 6\.1ERM baseline: hidden calibration disparity
ERM achieves overall AUC0\.9400\.940, ECE0\.0130\.013, and error rate5\.35%5\.35\\%– aggregate metrics that appear strong\. Subgroup decomposition reveals two hidden disparities\.
Ranking disparity\.Table[2](https://arxiv.org/html/2605.14074#S6.T2)shows that white, black, gay/lesbian, and muslim subgroups exhibit BPSN AUC≤0\.825\\leq 0\.825, well below the overall AUC of0\.9400\.940\. Error gaps reach\+0\.199\+0\.199for white, where the subgroup error rate is≈4×\\approx 4\\timesthe background rate\. High BNSP alongside low BPSN is the signature of identity mentions acting as toxicity signals\.
Table 2:ERM subgroup fairness\. “n/a” = fewer than 50 toxic subgroup examples for stable BNSP estimation\.Calibration disparity\.Despite overall ECE0\.0130\.013, every identity subgroup has significantly higher ECE than background \(Table[3](https://arxiv.org/html/2605.14074#S6.T3), all CIs exclude zero\)\. The gap reaches\+0\.134\+0\.134on jewish and\+0\.087\+0\.087on gay/lesbian\. The model is well calibrated on bulk content but systematically overconfident on identity\-mentioning content\. This is a fairness violation BPSN cannot detect: a prediction ofp=0\.85p=0\.85on white\-mentioning content does not correspond to the same accuracy asp=0\.85p=0\.85on background\.
Table 3:ERM calibration\-fairness gap\. Background ECE=0\.0099=0\.0099\(n=18,217n=18\{,\}217\)\. All gaps significant \(CIs exclude zero\)\.
### 6\.2Training\-time interventions
Table[4](https://arxiv.org/html/2605.14074#S6.T4)shows aggregate metrics for all three methods\. Both fairness methods produce significant AUC drops\. DRO’s ECE rises10×10\\times; Reweighted ECE rises3×3\\times\. Aggregate ECE is uninformative without subgroup decomposition, the methods produce categorically different calibration distributions\.
Table 4:Aggregate test metrics\. AUC drops vs\. ERM significant for both methods \(Reweighted CI\[−0\.019,−0\.008\]\[\-0\.019,\-0\.008\]; DRO CI\[−0\.016,−0\.007\]\[\-0\.016,\-0\.007\]\)\.Ranking axis\.Both fairness methods improve BPSN AUC on all eight identities \(Table[5](https://arxiv.org/html/2605.14074#S6.T5), all CIs exclude zero\) with Reweighted ERM leading on 7 of 8\. Both simultaneously reduce BNSP on every measurable identity, confirming a genuine fairness–accuracy trade\-off\. Subgroup AUC is unchanged across methods, showing the methods shift ranking across groups rather than improving within\-group discrimination\.
Table 5:Three\-way BPSN/BNSP\. Bold = best BPSN\. All BPSN gains and measurable BNSP losses significant \(paired bootstrap CIs exclude zero\)\.Calibration axis\.Figure[1](https://arxiv.org/html/2605.14074#S6.F1)and Table[6](https://arxiv.org/html/2605.14074#S6.T6)show three qualitatively different calibration profiles\. ERM has hidden subgroup disparity \(background ECE0\.0100\.010, but every subgroup significantly miscalibrated\)\. Reweighted ERM amplifies the disparity: background ECE rises modestly to0\.0250\.025, while subgroup gaps reach\+0\.232\+0\.232on white and\+0\.230\+0\.230on black, approximately3×3\\timesERM’s gaps, the fairness intervention worsened calibration disparity\. Group DRO eliminates subgroup disparity \(every gap CI crosses zero\) but only by becoming uniformly miscalibrated everywhere \(background ECE0\.1180\.118\)\. Reliability diagrams \(Figure[2](https://arxiv.org/html/2605.14074#S6.F2)\) confirm these patterns visually\.
Figure 1:Calibration\-fairness gap by method\. ERM has significant disparity on all eight identities\. Reweighted worsens it substantially\. DRO eliminates it but at the cost of uniform global miscalibration\.Table 6:Calibration\-fairness gap \(subgroup ECE−\-background ECE\)\.Bold= CI excludes zero\. Background ECE: ERM0\.0100\.010, Reweighted0\.0250\.025, DRO0\.1180\.118\.Figure 2:Reliability diagrams by subgroup\. Background curves hug the diagonal; identity subgroup curves deviate substantially\. Subgroup curves are noisy due to smallnn; Table[6](https://arxiv.org/html/2605.14074#S6.T6)is the reliable summary\.Tail axis\.Table[7](https://arxiv.org/html/2605.14074#S6.T7)and Figure[3](https://arxiv.org/html/2605.14074#S6.F3)show that similar BPSN gains arise from categorically different distribution shifts\. DRO uniformly right\-shifts all benign predictions \(meanpp:0\.214→0\.2530\.214\\to 0\.253for white;6868–76%76\\%of examples move up\)\. Reweighted ERM bimodally sharpens: the mean falls \(0\.214→0\.1290\.214\\to 0\.129\) as most examples move toward zero, but a small tail is pushed top\>0\.99p\>0\.99\. Critically, ERM produces zero benign\-at\-p\>0\.99p\>0\.99predictions; both fairness methods produce1\.51\.5–4\.3%4\.3\\%\. These confidently\-wrong predictions would auto\-flag benign identity speech in any pipeline using a high\-confidence removal threshold\. BPSN AUC cannot detect this failure mode\.
Figure 3:Predicted toxicity distributions on benign identity\-flagged comments\. ERM concentrates near zero; DRO right\-shifts; Reweighted ERM sharpens toward zero but develops ap\>0\.9p\>0\.9tail\.Table 7:Tail\-distribution probe on benign identity\-mentioning comments\. Both fairness methods produce1\.51\.5–4\.3%4\.3\\%benign\-at\-p\>0\.99p\>0\.99; ERM produces none\.
### 6\.3Post\-hoc interventions: inheriting training failure modes
Temperature scaling\.Table[8](https://arxiv.org/html/2605.14074#S6.T8)showsT∗=1\.0T^\{\*\}=1\.0for all three models, not a uniform null result, but three distinct failures\. ERM needs no fix \(already calibrated globally\)\. Reweighted ERM has selective miscalibration concentrated in identity subgroups; a scalar cannot repair subgroup\-specific errors\. DRO has structural decoupling of confidence from correctness; a scalar cannot restore a broken confidence\-correctness relationship\. Temperature scaling is a uniform\-shift correction, and none of the training methods produce uniform\-shift miscalibration\.
Table 8:Temperature scaling results\.T∗=1\.0T^\{\*\}=1\.0for all models; ECE is unchanged\. The failure reason differs by training method\.Per\-identity threshold optimization\.Table[9](https://arxiv.org/html/2605.14074#S6.T9)shows error gaps across methods\. On ERM, threshold optimization reduces the white gap by only−0\.002\-0\.002and worsens muslim by\+0\.019\+0\.019\. No method achieves threshold\-level parity\. The reason mirrors temperature scaling: threshold optimization corrects uniform\-shift bias, but the bias here is tail\-heavy\. Most benign comments are correctly scored near zero; a small tail is pushed to extreme confidence\. No single threshold separates confidently\-wrong benign comments from genuinely toxic ones in the same score range\.
Table 9:Error gap \(subgroup−\-background\) atτ=0\.5\\tau=0\.5across methods, and per\-identity threshold optimization on ERM\.Abstention\.Figure[4](https://arxiv.org/html/2605.14074#S6.F4)shows per\-subgroup risk\-coverage curves\. ERM’s overall risk drops from5\.35%5\.35\\%to0\.75%0\.75\\%at coverage0\.70\.7– textbook selective prediction behavior\. Reweighted ERM drops smoothly but less steeply\. Group DRO’s curve rises with deferral because its ECE of0\.1340\.134structurally decouples confidence from correctness: deferring low\-confidence predictions does not preferentially defer wrong ones\.*Group DRO and abstention do not compose\.*
Abstention is also unfair across subgroups \(Figure[4](https://arxiv.org/html/2605.14074#S6.F4)\)\. On background, ERM drives risk near zero by coverage0\.50\.5\. On the white subgroup, residual risk persists at≥14%\\geq 14\\%at every coverage level for all three methods\. Muslim and gay/lesbian fall between these extremes\. Abstention provides substantially less safety on identity\-mentioning content because the errors are confidently\-wrong rather than uncertain, they survive the confidence filter\. Table[10](https://arxiv.org/html/2605.14074#S6.T10)shows that muslim, female, and male gaps close under ERM abstention; the white gap persists\. The training\-method choice determines which subgroup receives safe deferral\.
Figure 4:Per\-subgroup risk\-coverage\. ERM and Reweighted ERM show clean abstention on background\. White subgroup residual risk stays≥14%\\geq 14\\%across all methods and coverage levels\. DRO shows minimal abstention benefit on any subgroup\.Table 10:ERM abstention error gaps\. Muslim, female, male gaps shrink toward zero by80%80\\%coverage\. White gap persists at≥0\.11\\geq 0\.11, signature of confidently\-wrong predictions surviving the confidence filter\.
### 6\.4Integrated picture: three axes, no dominant method
Table[11](https://arxiv.org/html/2605.14074#S6.T11)summarizes each method’s position across the three axes and post\-hoc compatibility\. No method is Pareto\-dominant\. ERM has the worst ranking fairness but cleanest abstention\. Reweighted ERM has the best ranking fairness but worst calibration disparity\. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated, breaking abstention entirely\. The choice of training method is a choice of which fairness axis to prioritize and which post\-hoc mechanism to retain\.
Finally, zero\-shot transfer to HateXplain collapses AUC from0\.9400\.940to0\.5640\.564with BPSN values near0\.50\.5; we treat this as a deployment\-generalization finding since near\-random in\-domain performance makes fairness comparisons uninterpretable\.
Table 11:Three\-axis summary\. No method dominates across all columns\.
### 6\.5Qualitative failure cases
Table[12](https://arxiv.org/html/2605.14074#S6.T12)shows benign white\-flagged comments where ERM already predictsp\>0\.6p\>0\.6\. All five span all three axes simultaneously: ranking failure \(scored above toxic background\), calibration failure \(high stated confidence, zero actual correctness\), and tail failure \(clustering atp≥0\.9p\\geq 0\.9\)\. Row 14132 is the only case where DRO correctly reduces the prediction \(0\.65 to 0\.14\) while Reweighted worsens it \(0\.90\), showing the methods do not fail uniformly\. These cases explain why ERM’s white abstention gap persists: the errors are high\-confidence, so the confidence filter cannot remove them\.
Table 12:Benign \(y=0y=0\) white\-flagged failure cases\.Bold=p\>0\.99p\>0\.99\.
### 6\.6Limitations
Sparse minority groups\.Rare identity groups haven<250n<250in our test set \(jewish: 83, gay/lesbian: 129, black: 146\), producing wide CIs\. Stratified identity oversampling would address this and likely stabilize DRO weight dynamics; our fix \(reducingη\\etafrom0\.010\.01to0\.0010\.001\) left background at83%83\\%of group weight\.
First\-match group assignment\.Multi\-identity comments contribute to only one group’s training signal and evaluation, potentially understating intersectional harms\.
Single model family\.All experiments use DistilBERT\-base\. The mechanisms we document likely reflect training\-objective properties rather than architecture, but replication on larger models is needed\.
ECE binning sensitivity\.We use 15 equal\-width bins throughout\. Alternative binning would shift absolute ECE values but not the qualitative ordering of methods or the significance of calibration\-fairness gaps\.
## 7Conclusion
Fairness in toxicity classification involves three integrated axes \- ranking, calibration, and abstention, and training\-time interventions determine what post\-hoc mechanisms can repair\. We showed this through a bootstrap\-validated comparison of ERM, Reweighted ERM, and Group DRO on Civil Comments, jointly evaluated with temperature scaling, abstention, and threshold optimization\.
ERM is significantly miscalibrated on every identity subgroup despite near\-perfect aggregate calibration, a fairness violation BPSN cannot detect\. Reweighted ERM improves ranking fairness but worsens calibration disparity\. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated, breaking abstention\. Each post\-hoc intervention fails in ways determined by the training method: temperature scaling findsT∗=1\.0T^\{\*\}=1\.0for three distinct reasons, threshold optimization cannot repair tail\-heavy bias, and abstention breaks under DRO while remaining unfair on white\-flagged content under all methods\.
SRAI fairness evaluation must combine ranking metrics, per\-subgroup calibration gaps with bootstrap CIs, per\-subgroup abstention behavior, and qualitative failure\-mode analysis\. Methods equivalent on aggregate ranking differ sharply in failure modes\. On this benchmark, Reweighted ERM is the more defensible practical choice, better calibration, abstention compatibility, equivalent ranking, but both fairness methods introduce confidently\-wrong predictions on benign identity content that ERM does not, and no post\-hoc intervention repairs this\.
## References
- \[1\]Borkan, D\., Dixon, L\., Sorensen, J\., Thain, N\., and Vasserman, L\. \(2019\)\. Nuanced metrics for measuring unintended bias with real data for text classification\.WWW Companion\.
- \[2\]Dixon, L\., Li, J\., Sorensen, J\., Thain, N\., and Vasserman, L\. \(2018\)\. Measuring and mitigating unintended bias in text classification\.AAAI/ACM AIES\.
- \[3\]Geifman, Y\., and El\-Yaniv, R\. \(2017\)\. Selective classification for deep neural networks\.NeurIPS 2017\.
- \[4\]Guo, C\., Pleiss, G\., Sun, Y\., and Weinberger, K\. Q\. \(2017\)\. On calibration of modern neural networks\.ICML 2017\.
- \[5\]Idrissi, B\. Y\., Arjovsky, M\., Pezeshki, M\., and Lopez\-Paz, D\. \(2022\)\. Simple data balancing achieves competitive worst\-group\-accuracy\.CLeaR 2022\.
- \[6\]Mathew, B\., Saha, P\., Yimam, S\. M\., Biemann, C\., Goyal, P\., and Mukherjee, A\. \(2021\)\. HateXplain: A benchmark dataset for explainable hate speech detection\.AAAI 2021\.
- \[7\]Sagawa, S\., Koh, P\. W\., Hashimoto, T\. B\., and Liang, P\. \(2020\)\. Distributionally robust neural networks for group shifts\.ICLR 2020\.
- \[8\]Sagawa, S\., Raghunathan, A\., Koh, P\. W\., and Liang, P\. \(2020\)\. An investigation of why overparameterization exacerbates spurious correlations\.ICML 2020\.
- \[9\]Sanh, V\., Debut, L\., Chaumond, J\., and Wolf, T\. \(2019\)\. DistilBERT, a distilled version of BERT\.NeurIPS EMC2 Workshop\.
- \[10\]Garg, S\., Perot, V\., Limtiaco, N\., Taly, A\., Chi, E\. H\., and Beutel, A\. \(2019\)\. Counterfactual fairness in text classification through robustness\.AAAI/ACM AIES\.
- \[11\]Pleiss, G\., Raghavan, M\., Wu, F\., Kleinberg, J\., and Weinberger, K\. Q\. \(2017\)\. On fairness and calibration\.NeurIPS 2017\.Similar Articles
PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat
This paper presents a system for the EEUCA 2026 shared task on toxicity detection in gaming chat, achieving 4th place by fine-tuning Llama 3.1 8B with synthetic data augmentation. It highlights a 'validation trap' phenomenon where high validation scores do not correlate with test performance due to dataset distribution shifts.
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
This replication study evaluates DExperts for mitigating toxicity in LLMs, finding near-perfect safety against explicit toxicity but reduced effectiveness against implicit hate speech and a significant latency trade-off.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
The paper introduces CITA, a framework for generating implicit toxicity attacks in Chinese to evaluate and improve LLM toxicity detectors, finding high attack success rates across tested models.