DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation
Summary
Introduces DriftGuard, a safety-aware adaptive moderation framework that uses multiple monitors to detect subtle, safety-relevant distribution shifts and selectively updates models with a hard-mix adaptation set, improving toxic recall on evolving datasets.
View Cached Full Text
Cached at: 06/30/26, 05:27 AM
# DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation
Source: [https://arxiv.org/html/2606.28725](https://arxiv.org/html/2606.28725)
Yuting Xin†\\dagger\*Department of Informationand Decision SciencesUniversity of MinnesotaMinneapolis, USAyuting\.xin@outlook\.comHanyu Cai†\\daggerDepartment of Industrial Engineeringand Management SciencesNorthwestern UniversityEvanston, USAhanyucai2022@u\.northwestern\.eduBinqi ShenDepartment of Industrial Engineeringand Management SciencesNorthwestern UniversityEvanston, USAbinqishen2021@u\.northwestern\.eduLier JinFuqua School of BusinessDuke UniversityDurham, USAlierjin@alumni\.duke\.eduLan HuDepartment of EngineeringCarnegie Mellon UniversityPittsburgh, USAlanh@alumni\.cmu\.edu
###### Abstract
Automated toxicity moderation systems operate in dynamic online environments where harmful behavior evolves through coded language, shifting targets, and strategic adaptation to enforcement\. Existing drift detection methods often focus on global distributional change, but such signals may miss safety\-relevant shifts that emerge in localized harm subspaces or high\-risk model\-error regions\. This paper introducesDriftGuard, a safety\-aware adaptive moderation framework that combines multi\-monitor drift detection with selective model updating\. The framework tracks global text drift, identity\-harm drift, model uncertainty, toxic\-risk drift, and false\-negative\-risk drift\. When safety\-relevant change is detected, the model is updated using a hard\-mix adaptation set that prioritizes likely false negatives, identity\-related high\-risk examples, false\-positive\-risk examples, and uncertain boundary cases\. Experiments on Civil Comments temporal shift and Jigsaw\-to\-DynaHate cross\-dataset shift show that safety\-aware monitors detect risks missed by global drift alone\. Hard\-mix adaptation improves toxic recall and accuracy over no\-update and random\-balanced baselines, raising toxic recall to 0\.8777 on Civil Comments and from 0\.7107 to 0\.8523 on DynaHate\. Bootstrap analysis further shows stable DynaHate safety gains, with toxic recall increasing by 0\.1418 and false\-negative prevalence decreasing by 0\.0781\. Overall, DriftGuard links safety\-aware drift detection to targeted, lightweight model updating for more robust adaptive toxicity moderation\.
## IIntroduction
Automated toxicity moderation systems are increasingly deployed to identify harmful user\-generated content at large scale\. Although modern classifiers can perform well under their original training distributions, moderation environments are non\-stationary: users alter phrasing, adopt coded language, shift targets, and respond strategically to platform enforcement, including adversarial prompt formulations that disguise harmful intent\[[10](https://arxiv.org/html/2606.28725#bib.bib1),[39](https://arxiv.org/html/2606.28725#bib.bib35),[18](https://arxiv.org/html/2606.28725#bib.bib38)\]\. As a result, a model that initially detects toxic or hateful content reliably may become less effective as harmful behavior evolves\. This creates a practical need for moderation systems that can detect safety\-relevant change and adapt efficiently without relying on frequent full retraining\[[26](https://arxiv.org/html/2606.28725#bib.bib2)\]\.
A central challenge is that distribution shift in moderation is not always visible at the aggregate level\[[4](https://arxiv.org/html/2606.28725#bib.bib3),[29](https://arxiv.org/html/2606.28725#bib.bib8)\]\. Recent work on LLM character understanding importantly shows that apparent benchmark success can reflect memorization rather than genuine reasoning, reinforcing our motivation to monitor safety\-relevant behavior rather than relying only on aggregate performance signals\[[14](https://arxiv.org/html/2606.28725#bib.bib26)\]\. Many drift detection methods monitor broad changes in the input or prediction distribution, but harmful behavior can emerge in localized safety\-critical regions, reinforcing the need to evaluate moderation systems using safety\-relevant behavior rather than only surface\-level aggregate metrics\[[38](https://arxiv.org/html/2606.28725#bib.bib36)\]\. In these cases, global text drift may remain modest even while the model becomes more likely to miss harmful content\[[24](https://arxiv.org/html/2606.28725#bib.bib5)\]\. Therefore, adaptive moderation requires monitoring signals that are aligned not only with distributional change, but also with moderation risk\[[27](https://arxiv.org/html/2606.28725#bib.bib4),[34](https://arxiv.org/html/2606.28725#bib.bib37),[25](https://arxiv.org/html/2606.28725#bib.bib25)\]\.
This paper proposes a safety\-aware adaptive moderation framework that links multi\-monitor drift detection with selective model updating\. The framework monitors incoming data using global distribution drift, harm\-subspace drift, model uncertainty, toxic\-risk drift, and false\-negative\-risk drift\. When one or more monitors indicate safety\-relevant change, the model is updated using a hard\-mix adaptation set composed of high\-risk and informative examples, rather than a random sample of recent data\. This design allows adaptation to focus on the types of examples most likely to affect moderation safety while keeping the update targeted and lightweight\[[30](https://arxiv.org/html/2606.28725#bib.bib7)\]\.
The paper makes three contributions\. First, it formulates moderation drift as a safety\-aware monitoring problem in which global distribution shift is supplemented by harm\-subspace and model\-risk signals\. Second, it introduces a multi\-monitor trigger mechanism paired with hard\-mix selective adaptation, connecting drift detection directly to model updating\. Third, it evaluates the framework under temporal and cross\-domain moderation shift, showing that safety\-aware monitors can detect targeted harm shifts missed by global drift alone and that hard\-mix adaptation improves toxic recall and overall robustness compared with no\-update and random\-balanced updating baselines\.
## IIRelated Work
Prior work has studied drift detection, toxicity moderation, and efficient adaptation largely as separate problems\. However, adaptive moderation requires these components to be connected: the system must detect safety\-relevant drift, determine whether the drift affects harmful or high\-risk subspaces, and update the model using informative examples rather than random recent data\. Our work addresses this gap through a multi\-monitor harm\-aware trigger and a hard\-mix adaptation strategy\.
### II\-ADistribution Shift and Drift Monitoring
Distribution shift is a central challenge for deployed machine learning systems\. A model is typically trained under one data\-generating distribution but deployed under conditions that may change over time, producing covariate shift, label shift, or concept drift\. Prior work has formalized these shift types and developed methods for detecting and adapting to non\-stationary data streams\. Surveys on concept drift emphasize that deployed models require not only drift detection, but also drift understanding and adaptation, because unaddressed drift can lead to model degradation over time\[[3](https://arxiv.org/html/2606.28725#bib.bib6)\]\. More recent deployment\-focused work similarly argues that model monitoring should track features, predictions, performance\-related signals, and explanation stability rather than assuming that a static validation set remains representative after deployment\[[24](https://arxiv.org/html/2606.28725#bib.bib5),[19](https://arxiv.org/html/2606.28725#bib.bib42)\]\.
A large body of work detects drift by comparing source and target distributions using statistical tests, divergence measures, or learned representations\. Rabanser et al\. show that dataset shift can often be detected through two\-sample testing after dimensionality reduction, and that domain\-discriminating models can help characterize whether observed shift is harmful\[[26](https://arxiv.org/html/2606.28725#bib.bib2)\]\. Other work focuses on specific forms of shift, such as label shift, where the class prior changes while the class\-conditional distribution remains stable\. For example, Black Box Shift Estimation uses a trained classifier’s predictions to estimate and correct label shift without requiring target labels\[[21](https://arxiv.org/html/2606.28725#bib.bib9)\]\. Benchmarks such as WILDS further show that real\-world distribution shifts can substantially reduce out\-of\-distribution performance even when models perform well in\-distribution\[[16](https://arxiv.org/html/2606.28725#bib.bib10)\]\.
However, existing drift monitoring methods often treat drift as a global property of the input or prediction distribution\. This is limiting for safety\-critical moderation\. First, a statistically detectable shift is not always equivalent to safety\-relevant model degradation; production\-monitoring studies have found that feature or prediction drift can occur without corresponding performance loss, while recent work on monitoring foundation models similarly shows that input shift and performance degradation do not always align directly\[[9](https://arxiv.org/html/2606.28725#bib.bib11)\]\. Recent RAG reliability studies provide important evidence that surface\-level relevance is not sufficient for reliable model behavior: retrieved context can shape outputs under knowledge conflict, and topically relevant citations may still fail to warrant the generated claim\[[5](https://arxiv.org/html/2606.28725#bib.bib24),[25](https://arxiv.org/html/2606.28725#bib.bib25)\]\. Second, global drift metrics may obscure localized changes in subpopulations or high\-risk regions of the data\. In moderation, a small increase in identity\-targeted abuse or false\-negative\-risk examples may be operationally important even if the aggregate input distribution changes only modestly\. Related work on discrepancy\-aware fusion and structured semantic signals also highlights the value of integrating complementary signals under noisy or domain\-specific conditions\[[7](https://arxiv.org/html/2606.28725#bib.bib28),[41](https://arxiv.org/html/2606.28725#bib.bib41)\]\. Our work builds on drift monitoring research but shifts the focus from global distribution change alone to safety\-aware monitoring, combining global drift with harm\-subspace and model\-risk signals\.
### II\-BToxicity Moderation and Harm Subspaces
Automated toxicity and hate speech detection is challenging because harmful content varies by target group, linguistic form, and social context\. Prior work shows that aggregate classification metrics can hide important subgroup failures\. Systematic reviews of hate speech detection identify persistent issues including ambiguous task definitions, class imbalance, contextual dependence, and limited cross\-domain generalization\[[13](https://arxiv.org/html/2606.28725#bib.bib12),[23](https://arxiv.org/html/2606.28725#bib.bib13)\]\.
A key concern is unintended bias in moderation models\. Borkan et al\. introduce Civil Comments identity annotations and metrics for measuring subgroup bias in toxicity classifiers\[[4](https://arxiv.org/html/2606.28725#bib.bib3)\]\. Related work shows that abusive language classifiers may over\-predict toxicity for dialectal or identity\-associated language, producing disproportionate errors for marginalized groups\[[29](https://arxiv.org/html/2606.28725#bib.bib8),[8](https://arxiv.org/html/2606.28725#bib.bib14)\]\. These findings motivate evaluation and monitoring at the harm\-subspace level rather than only at the aggregate dataset level\. Related work on fairness in synthetic medical data similarly shows that subgroup representation imbalances can persist in generated datasets, reinforcing the need to monitor protected or high\-risk subspaces rather than relying only on aggregate data quality\[[28](https://arxiv.org/html/2606.28725#bib.bib29)\]\.
Recent benchmarks further emphasize targeted diagnostic evaluation\. HateXplain provides target\-community and rationale annotations\[[22](https://arxiv.org/html/2606.28725#bib.bib15)\], HateCheck introduces functional tests for hate speech models\[[27](https://arxiv.org/html/2606.28725#bib.bib4)\], and DynaHate uses human\-and\-model\-in\-the\-loop data generation to expose challenging hate speech examples\[[32](https://arxiv.org/html/2606.28725#bib.bib16)\]\. These studies show that moderation failures often appear in specific behavioral or identity\-related regions of the data\.
Our work builds on this literature by moving harm\-subspace signals from post\-hoc diagnostic evaluation to drift\-triggered adaptation\. Instead of monitoring only global text drift, our framework tracks safety\-relevant subspaces, such as identity\-harm and false\-negative\-risk regions, and uses these signals to decide when model updating is needed\.
### II\-CSelective Adaptation and Efficient Model Updating
A separate line of work studies how models can be updated efficiently when new data becomes available\. Active learning selects informative examples for annotation or training, often using uncertainty, diversity, or expected model improvement as selection criteria\[[40](https://arxiv.org/html/2606.28725#bib.bib17),[17](https://arxiv.org/html/2606.28725#bib.bib18)\]\. Related hard\-example mining methods prioritize difficult or misclassified examples rather than treating all training samples equally\. Online hard example mining and focal loss both show that emphasizing hard or high\-loss examples can improve learning efficiency under class imbalance or many easy examples\[[31](https://arxiv.org/html/2606.28725#bib.bib19),[20](https://arxiv.org/html/2606.28725#bib.bib20)\]\. These ideas motivate selective updating strategies that focus adaptation on examples most likely to affect model behavior\[[37](https://arxiv.org/html/2606.28725#bib.bib33)\]\. Recent work on on\-policy distillation introduces token\-importance selection, showing that high\-entropy and high\-divergence tokens can provide especially useful learning signals and can be prioritized to reduce training cost while preserving performance\[[36](https://arxiv.org/html/2606.28725#bib.bib32)\]\.
For large neural models, full fine\-tuning can be computationally expensive and impractical for frequent updates\. Parameter\-efficient fine\-tuning methods address this by updating only a small subset of parameters or adding lightweight trainable modules\. LoRA freezes the base model and learns low\-rank update matrices, substantially reducing trainable parameters and memory cost while preserving downstream performance\[[12](https://arxiv.org/html/2606.28725#bib.bib21)\]\. Recent PEFT surveys further show that parameter\-efficient adaptation has become a practical approach for customizing large models under limited compute and deployment constraints\[[11](https://arxiv.org/html/2606.28725#bib.bib22),[33](https://arxiv.org/html/2606.28725#bib.bib23),[2](https://arxiv.org/html/2606.28725#bib.bib31),[42](https://arxiv.org/html/2606.28725#bib.bib40)\]\. Related work on structured medical report normalization likewise highlights the value of transforming noisy textual inputs into more consistent supervision for robust model training\[[6](https://arxiv.org/html/2606.28725#bib.bib39)\]\. This emphasis on deployment efficiency is consistent with recent work on large reasoning models, where pruning and distillation have been used to reduce reasoning cost while preserving task performance\[[15](https://arxiv.org/html/2606.28725#bib.bib27)\]\. Related work on adaptive distributed learning similarly highlights the value of monitoring deployment conditions and adjusting compression strategies to balance efficiency and model performance\[[35](https://arxiv.org/html/2606.28725#bib.bib34)\]\.
Our work connects selective sample choice with parameter\-efficient updating for adaptive moderation\. Rather than updating on random recent data, the proposed hard\-mix strategy selects high\-risk and high\-uncertainty examples associated with moderation failures, including likely false negatives, identity\-related high\-risk examples, false\-positive\-risk examples, and uncertain boundary cases\. Related work on prompt indicators further suggests that even lightweight input\-design choices can affect LLM behavior, reinforcing the need for careful evaluation when deploying adaptive language systems\[[1](https://arxiv.org/html/2606.28725#bib.bib30)\]\. This differs from general active learning or PEFT work by using safety\-aware drift monitors to decide both when adaptation should be triggered and which examples should drive the update\.
## IIIMethodology
### III\-AFramework Overview
This paper studies adaptive moderation under safety\-relevant distribution shift\. The proposed framework makes two linked decisions\. First, it monitors incoming data using multiple drift and model\-risk signals\. Second, when one or more monitors indicate safety\-relevant shift, it updates the model using a selectively constructed hard\-mix adaptation set rather than a random sample of recent data or a full retraining corpus\.
The framework is motivated by the fact that harmful behavior can change in localized subspaces\. Identity\-targeted abuse, coded harmful language, uncertain boundary cases, or likely false negatives may increase even when the aggregate text distribution changes only modestly\. Therefore, the framework combines broad distribution monitoring with harm\-subspace and model\-risk monitoring, then uses the detected failure modes to guide model updating\.
### III\-BTask Definition
We formulate moderation as binary toxic\-content classification\. Given a data samplexx, the model predicts
f\(x\)→\{non\-toxic,toxic\}\.f\(x\)\\rightarrow\\\{\\text\{non\-toxic\},\\text\{toxic\}\\\}\.
The classifier also outputs a toxic\-class probabilityptoxic\(x\)p\_\{\\mathrm\{toxic\}\}\(x\)\. This probability is used for classification, threshold calibration, uncertainty monitoring, toxic\-risk monitoring, and adaptation\-sample ranking\. The primary safety objective is to preserve recall on toxic content under shift, because false negatives correspond to harmful comments that remain undetected\.
### III\-CMulti\-Monitor Drift Detection
At each monitoring interval, the framework compares a target batchBtB\_\{t\}with a reference batchBrefB\_\{\\mathrm\{ref\}\}, whereBrefB\_\{\\mathrm\{ref\}\}represents the model’s baseline operating distribution\. The trigger uses five monitor families: global distribution drift, identity\-harm drift, model uncertainty drift, toxic\-risk drift, and false\-negative\-risk drift\.
#### III\-C1Global Distribution Drift
Global drift measures broad text\-distribution shift\. Comments are represented with hashed characternn\-gram features, and Jensen\-Shannon distance is computed between the reference and target feature distributions:
Dglobal=JS\(Pngram\(Bref\),Pngram\(Bt\)\)\.D\_\{\\mathrm\{global\}\}=JS\\\!\\left\(P\_\{\\mathrm\{ngram\}\}\(B\_\{\\mathrm\{ref\}\}\),P\_\{\\mathrm\{ngram\}\}\(B\_\{t\}\)\\right\)\.
This monitor captures major changes in language distribution, but it does not determine whether the shift is safety\-relevant\.
#### III\-C2Identity\-Harm Drift
Identity\-harm drift measures changes in identity\-targeted harmful content when identity annotations are available\. For example, this monitor uses theidentity\_attackannotation in the Civil Comments dataset\. We compute both the change in identity\-attack rate and the change in mean identity\-attack score:
Didentity\_rate=rateidentity\(Bt\)−rateidentity\(Bref\),D\_\{\\mathrm\{identity\\\_rate\}\}=\\mathrm\{rate\}\_\{\\mathrm\{identity\}\}\(B\_\{t\}\)\-\\mathrm\{rate\}\_\{\\mathrm\{identity\}\}\(B\_\{\\mathrm\{ref\}\}\),
Didentity\_score=meanidentity\(Bt\)−meanidentity\(Bref\)\.D\_\{\\mathrm\{identity\\\_score\}\}=\\mathrm\{mean\}\_\{\\mathrm\{identity\}\}\(B\_\{t\}\)\-\\mathrm\{mean\}\_\{\\mathrm\{identity\}\}\(B\_\{\\mathrm\{ref\}\}\)\.
This monitor captures localized harm shifts that may be diluted in aggregate text statistics\.
#### III\-C3Model Uncertainty Drift
Uncertainty drift measures whether the model becomes less confident on the target batch\. For each data sample, predictive entropy is computed from the toxic probability:
H\(x\)=−ptoxic\(x\)log2ptoxic\(x\)−\(1−ptoxic\(x\)\)log2\(1−ptoxic\(x\)\)\.H\(x\)=\-p\_\{\\mathrm\{toxic\}\}\(x\)\\log\_\{2\}p\_\{\\mathrm\{toxic\}\}\(x\)\-\\left\(1\-p\_\{\\mathrm\{toxic\}\}\(x\)\\right\)\\log\_\{2\}\\left\(1\-p\_\{\\mathrm\{toxic\}\}\(x\)\\right\)\.
The uncertainty monitor is the change in mean entropy:
Dentropy=1\|Bt\|∑x∈BtH\(x\)−1\|Bref\|∑x∈BrefH\(x\)\.D\_\{\\mathrm\{entropy\}\}=\\frac\{1\}\{\|B\_\{t\}\|\}\\sum\_\{x\\in B\_\{t\}\}H\(x\)\-\\frac\{1\}\{\|B\_\{\\mathrm\{ref\}\}\|\}\\sum\_\{x\\in B\_\{\\mathrm\{ref\}\}\}H\(x\)\.
An increase indicates that the target batch contains more examples which are close to the model’s decision boundary\.
#### III\-C4Toxic\-Risk Drift
Toxic\-risk drift measures whether the model assigns greater toxic risk to the target batch\. We use two quantities:
Dtoxic\_prob=1\|Bt\|∑x∈Btptoxic\(x\)−1\|Bref\|∑x∈Brefptoxic\(x\),D\_\{\\mathrm\{toxic\\\_prob\}\}=\\frac\{1\}\{\|B\_\{t\}\|\}\\sum\_\{x\\in B\_\{t\}\}p\_\{\\mathrm\{toxic\}\}\(x\)\-\\frac\{1\}\{\|B\_\{\\mathrm\{ref\}\}\|\}\\sum\_\{x\\in B\_\{\\mathrm\{ref\}\}\}p\_\{\\mathrm\{toxic\}\}\(x\),
Dtoxic\_prev=PredToxicRate\(Bt\)−PredToxicRate\(Bref\)\.D\_\{\\mathrm\{toxic\\\_prev\}\}=\\mathrm\{PredToxicRate\}\(B\_\{t\}\)\-\\mathrm\{PredToxicRate\}\(B\_\{\\mathrm\{ref\}\}\)\.
The first monitor captures changes in mean toxic probability, while the second captures changes in predicted toxic prevalence after thresholding\.
#### III\-C5False\-Negative\-Risk Drift
False\-negative\-risk drift measures whether the model becomes more likely to miss toxic content\. In the offline experiments, ground\-truth toxicity labels are available, allowing us to directly identify toxic examples that the model predicts as non\-toxic\. We compute the aggregate false\-negative prevalence of a batch as
FNPrev\(B\)=∑x∈B𝟙\[y\(x\)=toxic∧y^\(x\)=non\-toxic\]\|B\|\.\\mathrm\{FNPrev\}\(B\)=\\frac\{\\sum\_\{x\\in B\}\\mathbb\{1\}\[y\(x\)=\\text\{toxic\}\\land\\hat\{y\}\(x\)=\\text\{non\-toxic\}\]\}\{\|B\|\}\.
The false\-negative\-risk drift monitor is then
Dfn\_risk=FNPrev\(Bt\)−FNPrev\(Bref\)\.D\_\{\\mathrm\{fn\\\_risk\}\}=\\mathrm\{FNPrev\}\(B\_\{t\}\)\-\\mathrm\{FNPrev\}\(B\_\{\\mathrm\{ref\}\}\)\.
This quantity differs from label\-conditional false\-negative prevalence, which is1−1\-toxic recall\. We report toxic recall separately, and use aggregate false\-negative prevalence to measure the share of all evaluated comments that are toxic but missed\. In live deployment, immediate labels may be unavailable; in that case, this monitor can be approximated using delayed review outcomes, weak safety labels, model disagreement, high\-risk lexical indicators, or policy\-specific risk signals\.
### III\-DTrigger Rule
The framework triggers adaptation when any monitor exceeds its operating threshold:
trigger\(Bt\)=\\displaystyle\\mathrm\{trigger\}\(B\_\{t\}\)=\[Dglobal\>θglobal\]\\displaystyle\[D\_\{\\mathrm\{global\}\}\>\\theta\_\{\\mathrm\{global\}\}\]∨\[Didentity\_rate\>θidentity\_rate\]\\displaystyle\\lor\[D\_\{\\mathrm\{identity\\\_rate\}\}\>\\theta\_\{\\mathrm\{identity\\\_rate\}\}\]∨\[Didentity\_score\>θidentity\_score\]\\displaystyle\\lor\[D\_\{\\mathrm\{identity\\\_score\}\}\>\\theta\_\{\\mathrm\{identity\\\_score\}\}\]∨\[Dentropy\>θentropy\]\\displaystyle\\lor\[D\_\{\\mathrm\{entropy\}\}\>\\theta\_\{\\mathrm\{entropy\}\}\]∨\[Dtoxic\_prob\>θtoxic\_prob\]\\displaystyle\\lor\[D\_\{\\mathrm\{toxic\\\_prob\}\}\>\\theta\_\{\\mathrm\{toxic\\\_prob\}\}\]∨\[Dtoxic\_prev\>θtoxic\_prev\]\\displaystyle\\lor\[D\_\{\\mathrm\{toxic\\\_prev\}\}\>\\theta\_\{\\mathrm\{toxic\\\_prev\}\}\]∨\[Dfn\_risk\>θfn\_risk\]\.\\displaystyle\\lor\[D\_\{\\mathrm\{fn\\\_risk\}\}\>\\theta\_\{\\mathrm\{fn\\\_risk\}\}\]\.
The rule is intentionally disjunctive because the monitors represent different failure modes\. The framework therefore reports individual monitor values and trigger decisions rather than collapsing them into a single aggregate score\.
The thresholds are monitor\-specific operating points\. The raw monitor values are not directly comparable because Jensen\-Shannon distance measures feature\-distribution divergence, while the other monitors measure changes in harm annotations or model behavior\. The thresholds should therefore be interpreted as risk\-tolerance settings for each monitor rather than universal drift magnitudes\.
### III\-EHard\-Mix Selective Adaptation
When the trigger detects safety\-relevant shift, the model is updated using hard\-mix selective adaptation\. Given an adaptation budgetKK, the method ranks recent examples into four priority groups\. Table I summarizes the target allocation and ranking criterion for each component of the hard\-mix adaptation set\.
TABLE I:Hard\-mix adaptation components\.The toxic false\-negative\-risk component targets toxic examples that the model is likely to miss\. The identity\-related component focuses this safety objective on identity\-targeted harm when the required annotations are available\. The false\-positive\-risk component adds non\-toxic examples that the model is likely to over\-flag, reducing the chance that adaptation shifts the classifier too aggressively toward toxic predictions\. The entropy component adds uncertain boundary cases\. If a priority group contains fewer examples than its allocated budget, the remaining slots are filled using the highest\-ranked remaining examples by sample\-level drift score, followed by random fallback if necessary\.
### III\-FLoRA\-Based Model Updating
Model updates are performed with LoRA rather than full model fine\-tuning\. The base transformer is frozen except for low\-rank trainable adaptation matrices and the classification head\. This makes the update lightweight and suitable for repeated adaptation under deployment constraints\. The same LoRA adaptation procedure is used for both hard\-mix and random\-balanced update baselines; the difference between the methods is the construction of the adaptation set\.
## IVExperimental Setup
### IV\-ADatasets and Shift Settings
We evaluate the framework in two settings to demonstrate its flexibility across temporal and cross\-dataset moderation shifts\. In the Civil Comments temporal\-shift setting, the baseline model is trained on earlier Civil Comments data and evaluated on later Civil Comments data\. Civil Comments includes identity\-related annotations, enabling direct measurement of identity\-harm drift through theidentity\_attacksignal\. The reported Civil experiments use fixed\-seed temporal samples with up to 5,000 examples per year\-label group and evaluate on a 6,500\-example 2017 target split\.
In the Jigsaw\-to\-DynaHate cross\-dataset setting, the baseline model is trained on Jigsaw toxic\-comment data and evaluated on DynaHate\. This setting tests a stronger shift because DynaHate contains adversarially collected hate\-speech examples that differ from the source training distribution\. The reported runs use 12,000 Jigsaw training examples, 3,000 Jigsaw holdout examples, 3,000 synthetic drift examples, 3,000 DynaHate adaptation\-pool examples, and 3,000 DynaHate evaluation examples\. DynaHate does not provide the same continuous identity\-attack annotation as Civil Comments, so the identity\-harm monitor is inactive in this setting\.
### IV\-BModel and Training Details
All reported LoRA experiments usedistilroberta\-baseas the base transformer classifier\. LoRA is applied to thequeryandvaluemodules with rankr=8r=8,α=16\\alpha=16, dropout0\.050\.05, and no bias adaptation\. The source model is trained for 3 epochs\. Drift\-triggered adaptation is then performed for 1 epoch on the selected adaptation set\. Training uses learning rate1×10−41\\times 10^\{\-4\}, weight decay0\.010\.01, warmup ratio0\.030\.03, maximum sequence length 192, batch size 16, gradient accumulation 1, and FP16 precision on GPU\.
### IV\-CClassification Thresholds
The toxic/non\-toxic decision threshold is not fixed at 0\.5 in the final reported evaluations\. Instead, thresholds are calibrated to satisfy a target toxic recall of 0\.85 when possible\. The threshold search evaluates values from 0\.05 to 0\.95 in increments of 0\.01\. Among thresholds meeting the target recall, the selected threshold maximizes macro F1\. If no threshold reaches the target recall, the selected threshold minimizes the recall shortfall and then maximizes macro F1\. Baseline thresholds are calibrated on the source validation or holdout split\. Adapted\-model thresholds are calibrated on a held\-out threshold\-tuning subset of the adaptation pool\. The same threshold\-calibration procedure is applied to both hard\-mix and random\-balanced adapted models\.
### IV\-DMonitoring Windows and Trigger Thresholds
For each experiment, the reference batch is the model’s source\-domain validation or holdout distribution, and the target batch is the shifted evaluation distribution\. The monitoring windows therefore correspond to the same fixed evaluation splits used for reporting drift and model behavior\. For adaptation pools, 50% of the pool is reserved for threshold tuning and the remaining 50% is available for sample selection\.
The global Jensen\-Shannon drift threshold is 0\.30\. The identity\-attack rate threshold is 0\.005 in the reported harm\-aware runs, and the false\-negative\-risk drift threshold is 0\.005\. The default identity\-score, mean\-toxic\-probability, and entropy thresholds are 0\.01, 0\.03, and 0\.03, respectively\. These thresholds are treated as operating points for the experimental protocol rather than universal constants\.
### IV\-EAdaptation Budget and Baselines
Each LoRA update uses an adaptation budget ofK=800K=800examples\. Under hard\-mix selection, this corresponds to target allocations of 280 toxic false\-negative\-risk examples, 200 identity\-related toxic false\-negative\-risk examples, 200 non\-toxic false\-positive\-risk examples, and 120 high\-entropy examples\. When identity annotations are unavailable, as in DynaHate, the identity\-related allocation is filled by fallback ranked examples\.
We compare hard\-mix adaptation with two baselines\. The no\-update baseline evaluates the original model on the shifted target data without adaptation\. The random\-balanced update baseline adapts the model using 800 randomly selected recent examples balanced across toxic and non\-toxic labels\. This comparison tests whether the gains come from updating on recent data generally or from risk\-aware sample selection\.
### IV\-FEvaluation Metrics and Replication
We evaluate model performance using four primary metrics:
- •Macro F1:Measures class\-balanced classification quality\.
- •Toxic recall:Measures label\-conditional recall on toxic examples\.
- •Aggregate false\-negative prevalence:Measures the proportion of all evaluation examples that are toxic but predicted as non\-toxic\.
- •Accuracy:Measures overall classification correctness\.
All reported LoRA adaptation experiments are repeated over three random seeds: 13, 42, and 101\. For the DynaHate robustness experiment, we additionally compute bootstrap confidence intervals by resampling pooled prediction rows with replacement and recomputing metric deltas between the no\-update and adapted models\.
## VExperiments
This section reports the experimental results for multi\-monitor drift detection and selective adaptation on Civil Comments and DynaHate\. Reported values are averaged across three random seeds\. Adaptation methods are compared under the same adaptation budget and evaluated on the same target splits\.
### V\-ACivil Comments Results
#### V\-A1Drift Monitor Outcomes
In the Civil Comments temporal experiment, global JS drift does not trigger under the original global threshold, but identity\-harm drift does\.
TABLE II:Civil Comments drift monitor outcomes\.As shown in Table[II](https://arxiv.org/html/2606.28725#S5.T2), the Civil Comments result shows a targeted safety shift rather than broad distributional drift\. Global JS drift remains low \(0\.0775\) and does not trigger adaptation, while both identity\-harm monitors trigger: identity attack rate increases by 0\.0325 and mean identity attack score increases by 0\.0313\. In addition, false\-negative\-risk drift increases by 0\.0123, indicating that the model is more likely to miss toxic examples under the shifted distribution\. By contrast, entropy, mean toxic probability, and predicted toxic prevalence remain nearly unchanged\. This pattern supports the central claim that global distribution drift alone is insufficient for adaptive moderation: safety\-relevant drift can emerge in targeted harm subspaces even when the overall input distribution appears stable\.
#### V\-A2Adaptation Performance
As shown in Table[III](https://arxiv.org/html/2606.28725#S5.T3), the strategy comparison shows that hard\-mix updating provides the strongest overall adaptation outcome\. Compared with random\-balanced updating, hard\-mix achieves substantially higher toxic recall, increasing recall from 0\.8501 to 0\.8777\. This indicates that hard\-mix is more effective at recovering the model’s ability to detect toxic content, which is the primary safety objective in moderation\. Hard\-mix also achieves higher overall accuracy, improving from 0\.8238 under random\-balanced updating to 0\.8334\. Although random\-balanced updating reports a lower aggregate false\-negative prevalence, hard\-mix provides the better tradeoff between safety recall and overall predictive performance\. These results suggest that selecting high\-risk and high\-uncertainty examples is more effective than updating on a general balanced sample when the goal is to preserve moderation safety under drift\.
TABLE III:Civil Comments update strategy comparison\.
### V\-BDynaHate Results
#### V\-B1Drift Monitor Outcomes
Compared with Civil Comments, DynaHate exhibits a larger cross\-domain distributional shift\. The global JS drift monitor is triggered, and the model\-risk monitors show substantial increases in uncertainty, predicted toxicity, and false\-negative risk\.
TABLE IV:DynaHate drift monitor outcomes\.As shown in Table[IV](https://arxiv.org/html/2606.28725#S5.T4), the DynaHate results show a substantial cross\-domain shift relative to the Jigsaw training distribution\. Global JS drift triggers at 0\.2796, and all model\-risk monitors also increase: entropy \(\+0\.2188\), mean toxic probability \(\+0\.1409\), predicted toxic prevalence \(\+0\.1653\), and false\-negative\-risk \(\+0\.1311\)\. These results indicate that the DynaHate shift affects both the input distribution and the model’s safety\-relevant behavior, especially its risk of missing toxic content\.
#### V\-B2Adaptation Performance
Hard\-mix adaptation substantially improves toxic recall and false\-negative prevalence on DynaHate\.
TABLE V:DynaHate update strategy comparison\.As shown in Table[V](https://arxiv.org/html/2606.28725#S5.T5), the DynaHate strategy comparison shows that both adaptation methods improve safety\-relevant performance relative to the no\-update baseline\. Random\-balanced updating increases toxic recall from 0\.7107 to 0\.8171 and reduces the false\-negative prevalence from 0\.1594 to 0\.0787\. Hard\-mix updating produces the strongest overall result, achieving the highest macro F1 \(0\.5343\), toxic recall \(0\.8523\), and accuracy \(0\.6010\)\. Although random\-balanced updating has a slightly lower false\-negative prevalence, hard\-mix provides the best balance between toxic\-content detection and overall predictive performance\. This suggests that risk\-aware sample selection is more effective than general balanced updating under the stronger DynaHate cross\-domain shift\.
#### V\-B3Bootstrap Confidence Intervals
The bootstrap analysis evaluates whether the observed DynaHate improvements are stable under resampling of the evaluation predictions\. We repeatedly resample the pooled predictions with replacement and recompute the metric deltas between the no\-update and adapted models\. As shown in Table[VI](https://arxiv.org/html/2606.28725#S5.T6), the resulting confidence intervals show that the safety gains are robust: toxic recall increases by 0\.1418 with a 95% CI of \[0\.1305, 0\.1531\], and false\-negative prevalence decreases by 0\.0781 with a 95% CI of \[\-0\.0846, \-0\.0717\]\. By contrast, macro F1 changes only slightly and its interval includes zero, suggesting that adaptation mainly improves the safety\-critical ability to detect toxic content rather than broadly improving all classification metrics\.
TABLE VI:Bootstrap confidence intervals for DynaHate metric changes\.
### V\-CCross\-Experiment Findings
The updated experiments support two main findings\.
Finding 1: Multiple monitors detect safety\-relevant drift that global drift alone does not capture\.The Civil Comments result shows a targeted harm shift: global JS drift remains low at 0\.0775 and does not trigger, while identity attack rate drift \(\+0\.0325\), identity attack score drift \(\+0\.0313\), and false\-negative\-risk drift \(\+0\.0123\) all trigger\. This pattern indicates that harmful behavior can change in a safety\-critical subspace even when the overall text distribution appears stable\. In contrast, DynaHate shows a broader cross\-domain shift\. Global JS drift reaches 0\.2796, and the model\-risk monitors increase substantially in entropy \(\+0\.2188\), mean toxic probability \(\+0\.1409\), predicted toxic prevalence \(\+0\.1653\), and false\-negative risk \(\+0\.1311\)\. Together, the two settings show why adaptive moderation should monitor both distributional drift and safety\-specific model\-risk signals\. Civil Comments demonstrates that global drift can miss targeted harm\-subspace changes, while DynaHate demonstrates that stronger cross\-domain shift can affect both input distributions and model behavior\.
Finding 2: Risk\-aware hard\-mix updating provides a stronger adaptation tradeoff than general balanced updating\.After drift is detected, updating the model with carefully selected samples improves safety\-oriented performance\. Hard\-mix updating raises toxic recall to 0\.8777 and accuracy to 0\.8334 on Civil Comments, and raises toxic recall from 0\.7107 to 0\.8523 and accuracy from 0\.5568 to 0\.6010 on DynaHate\. The DynaHate bootstrap analysis confirms that these safety gains are stable: toxic recall increases by 0\.1418 with a 95% confidence interval of \[0\.1305, 0\.1531\], and false\-negative prevalence decreases by 0\.0781 with a 95% confidence interval of \[\-0\.0846, \-0\.0717\]\.
The comparison with random\-balanced updating shows why sample selection matters\. On Civil Comments, hard\-mix achieves higher toxic recall than random\-balanced updating \(0\.8777 vs\. 0\.8501\) and higher accuracy \(0\.8334 vs\. 0\.8238\)\. On DynaHate, hard\-mix also outperforms random\-balanced updating on toxic recall \(0\.8523 vs\. 0\.8171\), macro F1 \(0\.5343 vs\. 0\.5255\), and accuracy \(0\.6010 vs\. 0\.5864\)\. Although random\-balanced updating remains competitive on aggregate false\-negative prevalence, hard\-mix better matches the paper’s adaptation objective: selecting high\-risk and high\-uncertainty samples to recover safety recall while preserving overall predictive performance\.
## VIConclusion
This paper evaluated whether adaptive moderation can be improved by connecting drift detection to safety\-aware model updating\. The results show that drift in moderation is not always visible through a single global distribution metric\. In the Civil Comments temporal setting, safety\-relevant identity\-harm signals changed even when global drift did not trigger\. In the DynaHate transfer setting, both global drift and model\-risk monitors increased, indicating a broader and more severe shift\. Together, these findings support the use of multiple monitors that separately track distributional change, harm\-subspace change, uncertainty, toxic\-risk, and false\-negative risk\.
The adaptation results further show that the choice of update data matters\. Updating the model with hard\-mix examples improved toxic recall and reduced missed toxic content, especially under the stronger DynaHate shift\. Compared with random\-balanced updating, hard\-mix adaptation produced stronger toxic\-recall and accuracy tradeoffs across the reported experiments\. This suggests that adaptive moderation should not only decide when to update, but also select examples according to the failure modes that triggered the update\.
This study also has limitations that suggest directions for future work\. Some harm\-aware monitors depend on dataset\-specific annotations, such as Civil Comments identity\-attack scores, and should be extended with annotation\-independent harm\-subspace monitors\. The trigger thresholds are operational choices rather than universal constants, so future work should study calibration through validation data, risk\-cost curves, or conformal monitoring\. Finally, the experiments are conducted in offline benchmark settings; real\-world moderation systems may involve delayed labels, adversarial behavior, policy changes, and human\-review costs\. Future work should evaluate DriftGuard in online or human\-in\-the\-loop moderation workflows, including carefully validated model\-assisted review mechanisms for prioritizing uncertain cases and generating weak safety signals\.
Overall, the study supports a safety\-aware view of moderation drift: model updating should be triggered by signals that reflect emerging harm and model risk, not only by aggregate distribution shift\. A multi\-monitor trigger paired with selective hard\-mix adaptation provides a practical step toward moderation systems that can respond more directly to evolving harmful behavior while limiting unnecessary or poorly targeted retraining\.
## References
- \[1\]\(2026\)A comprehensive analysis of indicator effect on llm performance\.Note:Available: https://doi\.org/10\.13140/RG\.2\.2\.29554\.98248Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p3.1)\.
- \[2\]A\. Ainiwaer, Q\. Liu, and M\. Lily\(2026\)Study of examples effect on the llm performance\.Note:Available: https://https://doi\.org/10\.13140/RG\.2\.2\.20382\.40007Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.
- \[3\]F\. Bayram, B\. S\. Ahmed, and A\. Kassler\(2022\)From concept drift to model degradation: an overview on performance\-aware drift detectors\.Knowledge\-Based Systems245,pp\. 108632\.Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p1.1)\.
- \[4\]D\. Borkan, L\. Dixon, J\. Sorensen, N\. Thain, and L\. Vasserman\(2019\)Nuanced metrics for measuring unintended bias with real data for text classification\.InCompanion proceedings of the 2019 world wide web conference,pp\. 491–500\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1),[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p2.1)\.
- \[5\]Y\. Chen, P\. Qian, S\. Wang, S\. Zhang, H\. Xu, S\. Lin, and X\. Wei\(2026\)Does rag know when retrieval is wrong? diagnosing context compliance under knowledge conflict\.External Links:2605\.14473,[Link](https://arxiv.org/abs/2605.14473)Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p3.1)\.
- \[6\]Y\. Chu, X\. Ma, X\. Jin, G\. Luo, and X\. Gao\(2026\)MedTri: a platform for structured medical report normalization to enhance vision\-language pretraining\.arXiv preprint arXiv:2602\.22143\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.
- \[7\]Y\. Dang, Z\. Pan, X\. Zhang, W\. Chen, F\. Cai, and H\. Chen\(2025\)Discrepancy learning guided hierarchical fusion network for multi\-modal recommendation\.Knowledge\-Based Systems317,pp\. 113496\.Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p3.1)\.
- \[8\]T\. Davidson, D\. Bhattacharya, and I\. Weber\(2019\)Racial bias in hate speech and abusive language detection datasets\.InProceedings of the third workshop on abusive language online,pp\. 25–35\.Cited by:[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p2.1)\.
- \[9\]B\. Eck, D\. Kabakci\-Zorlu, Y\. Chen, F\. Savard, and X\. Bao\(2022\)A monitoring framework for deployed machine learning models with supply chain examples\.In2022 IEEE International Conference on Big Data \(Big Data\),pp\. 2231–2238\.Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p3.1)\.
- \[10\]J\. Gama, I\. Žliobaitė, A\. Bifet, M\. Pechenizkiy, and A\. Bouchachia\(2014\)A survey on concept drift adaptation\.ACM computing surveys \(CSUR\)46\(4\),pp\. 1–37\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p1.1)\.
- \[11\]Z\. Han, C\. Gao, J\. Liu, J\. Zhang, and S\. Q\. Zhang\(2024\)Parameter\-efficient fine\-tuning for large models: a comprehensive survey\.arXiv preprint arXiv:2403\.14608\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.
- \[12\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.
- \[13\]M\. S\. Jahan and M\. Oussalah\(2023\)A systematic review of hate speech automatic detection using natural language processing\.Neurocomputing546,pp\. 126232\.Cited by:[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p1.1)\.
- \[14\]Y\. Jiang and F\. Ferraro\(2026\)Beyond math: stories as a testbed for memorization\-constrained reasoning in llms\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5590–5607\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1)\.
- \[15\]Y\. Jiang, D\. Li, and F\. Ferraro\(2026\)DRP: distilled reasoning pruning with skill\-aware step decomposition for efficient large reasoning models\.External Links:2505\.13975,[Link](https://arxiv.org/abs/2505.13975)Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.
- \[16\]P\. W\. Koh, S\. Sagawa, H\. Marklund, S\. M\. Xie, M\. Zhang, A\. Balsubramani, W\. Hu, M\. Yasunaga, R\. L\. Phillips, I\. Gao,et al\.\(2021\)Wilds: a benchmark of in\-the\-wild distribution shifts\.InInternational conference on machine learning,pp\. 5637–5664\.Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p2.1)\.
- \[17\]D\. Li, Z\. Wang, Y\. Chen, R\. Jiang, W\. Ding, and M\. Okumura\(2024\)A survey on deep active learning: recent advances and new frontiers\.IEEE Transactions on Neural Networks and Learning Systems36\(4\),pp\. 5879–5899\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p1.1)\.
- \[18\]L\. Lin, J\. You, Y\. Li, L\. Lin, Y\. Wang, Z\. Zhang, and M\. Zheng\(2026\)Reflect\-guard: enhancing llm safeguards against adversarial prompts via logical self\-reflection\.arXiv preprint arXiv:2605\.24834\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2605.24834),[Link](https://arxiv.org/abs/2605.24834)Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p1.1)\.
- \[19\]L\. Lin and Y\. Wang\(2025\)SHAP stability in credit risk management: a case study in credit card default model\.Risks13\(12\),pp\. 238\.External Links:[Document](https://dx.doi.org/10.3390/risks13120238),[Link](https://doi.org/10.3390/risks13120238)Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p1.1)\.
- \[20\]T\. Lin, P\. Goyal, R\. Girshick, K\. He, and P\. Dollár\(2017\)Focal loss for dense object detection\.InProceedings of the IEEE international conference on computer vision,pp\. 2980–2988\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p1.1)\.
- \[21\]Z\. Lipton, Y\. Wang, and A\. Smola\(2018\)Detecting and correcting for label shift with black box predictors\.InInternational conference on machine learning,pp\. 3122–3130\.Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p2.1)\.
- \[22\]B\. Mathew, P\. Saha, S\. M\. Yimam, C\. Biemann, P\. Goyal, and A\. Mukherjee\(2021\)Hatexplain: a benchmark dataset for explainable hate speech detection\.InProceedings of the AAAI conference on artificial intelligence,Vol\.35,pp\. 14867–14875\.Cited by:[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p3.1)\.
- \[23\]J\. Pavlopoulos, J\. Sorensen, L\. Dixon, N\. Thain, and I\. Androutsopoulos\(2020\)Toxicity detection: does context really matter?\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 4296–4305\.Cited by:[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p1.1)\.
- \[24\]F\. M\. Polo, R\. Izbicki, E\. G\. Lacerda Jr, J\. P\. Ibieta\-Jimenez, and R\. Vicente\(2023\)A unified framework for dataset shift diagnostics\.Information Sciences649,pp\. 119612\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p1.1)\.
- \[25\]P\. Qian, S\. Wang, X\. Wang, Y\. Chen, W\. Xu, Q\. Yu, S\. Lin, S\. Zhang, J\. You, and X\. Wei\(2026\)Relevant is not warranted: evidence\-force calibration for cited rag\.External Links:2605\.28044,[Link](https://arxiv.org/abs/2605.28044)Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p3.1)\.
- \[26\]S\. Rabanser, S\. Günnemann, and Z\. Lipton\(2019\)Failing loudly: an empirical study of methods for detecting dataset shift\.Advances in Neural Information Processing Systems32\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p1.1),[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p2.1)\.
- \[27\]P\. Röttger, B\. Vidgen, D\. Nguyen, Z\. Talat, H\. Margetts, and J\. Pierrehumbert\(2021\)HateCheck: functional tests for hate speech detection models\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 41–58\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1),[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p3.1)\.
- \[28\]S\. Salarian, Y\. Zhang, S\. Padhee, and S\. Parthasarathy\(2025\)MedEqualizer: a framework investigating bias in synthetic medical data and mitigation via augmentation\.arXiv preprint arXiv:2511\.01054\.Cited by:[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p2.1)\.
- \[29\]M\. Sap, D\. Card, S\. Gabriel, Y\. Choi, and N\. A\. Smith\(2019\)The risk of racial bias in hate speech detection\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 1668–1678\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1),[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p2.1)\.
- \[30\]B\. Settles\(2009\)Active learning literature survey\.Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p3.1)\.
- \[31\]A\. Shrivastava, A\. Gupta, and R\. Girshick\(2016\)Training region\-based object detectors with online hard example mining\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 761–769\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p1.1)\.
- \[32\]B\. Vidgen, T\. Thrush, Z\. Talat, and D\. Kiela\(2021\)Learning from the worst: dynamically generated datasets to improve online hate detection\.InProceedings of the 59th annual meeting of the Association for Computational Linguistics and the 11th international joint conference on natural language processing \(volume 1: long papers\),pp\. 1667–1682\.Cited by:[§II\-B](https://arxiv.org/html/2606.28725#S2.SS2.p3.1)\.
- \[33\]L\. Wang, S\. Chen, L\. Jiang, S\. Pan, R\. Cai, S\. Yang, and F\. Yang\(2024\)Parameter\-efficient fine\-tuning in large models: a survey of methodologies\.arXiv preprint arXiv:2410\.19878\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.
- \[34\]S\. Wang, P\. Qian, Y\. Chen, J\. You, X\. Wang, X\. Jiang, L\. Liu, H\. Yu, and J\. Xu\(2026\)When safe skills collide: measuring compositional risk in agent skill ecosystems\.External Links:2606\.00448,[Link](https://arxiv.org/abs/2606.00448)Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1)\.
- \[35\]Y\. Wang, X\. Li, R\. Wu, H\. Chen, and D\. Kutscher\(2025\)NetSenseML: network\-adaptive compression for efficient distributed machine learning\.InEuro\-Par 2025: Parallel Processing: 31st European Conference on Parallel and Distributed Processing, Dresden, Germany, August 25–29, 2025, Proceedings, Part III,Berlin, Heidelberg,pp\. 283–297\.External Links:ISBN 978\-3\-031\-99871\-3,[Link](https://doi.org/10.1007/978-3-031-99872-0_20),[Document](https://dx.doi.org/10.1007/978-3-031-99872-0%5F20)Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.
- \[36\]Y\. Xu, H\. Sang, Z\. Zhou, R\. He, Z\. Wang, and A\. Geramifard\(2026\)TIP: token importance in on\-policy distillation\.External Links:2604\.14084,[Link](https://arxiv.org/abs/2604.14084)Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p1.1)\.
- \[37\]Y\. Xu, H\. Sang, Z\. Zhou, R\. He, and Z\. Wang\(2026\)PACED: distillation and on\-policy self\-distillation at the frontier of student competence\.External Links:2603\.11178,[Link](https://arxiv.org/abs/2603.11178)Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p1.1)\.
- \[38\]J\. Yao, Z\. Zheng, and B\. Li\(2026\)Measuring whether llm tutors teach or solve: a diagnostic for educational impact\.External Links:2606\.16206,[Link](https://arxiv.org/abs/2606.16206)Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p2.1)\.
- \[39\]J\. Yao, Z\. Zheng, and J\. Long\(2026\)Ranking abuse via strategic pairwise data perturbations\.External Links:2604\.17805,[Link](https://arxiv.org/abs/2604.17805)Cited by:[§I](https://arxiv.org/html/2606.28725#S1.p1.1)\.
- \[40\]Z\. Zhang, E\. Strubell, and E\. Hovy\(2022\)A survey of active learning for natural language processing\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 6166–6190\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p1.1)\.
- \[41\]Z\. Zhang, R\. Fu, Y\. He, X\. Shen, Y\. Wang, X\. Du, H\. You, K\. Jin, J\. Shi, and S\. Fong\(2026\)FinSentLLM: multi\-llm and structured semantic signals for enhanced financial sentiment forecasting\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 17682–17686\.Cited by:[§II\-A](https://arxiv.org/html/2606.28725#S2.SS1.p3.1)\.
- \[42\]S\. Zhou, S\. Wang, Z\. Yuan, M\. Shi, Y\. Shang, and D\. Yang\(2025\)GSQ\-tuning: group\-shared exponents integer in fully quantized training for llms on\-device fine\-tuning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 22971–22988\.Cited by:[§II\-C](https://arxiv.org/html/2606.28725#S2.SS3.p2.1)\.Similar Articles
Fair and Calibrated Toxicity Detection with Robust Training and Abstention
This paper studies fairness in toxicity classification across three axes: ranking, calibration, and abstention. It compares ERM, reweighted ERM, and Group DRO methods with post-hoc interventions, finding that calibration disparity is a hidden fairness violation and that abstention itself can be unfair.
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat
This paper presents a system for the EEUCA 2026 shared task on toxicity detection in gaming chat, achieving 4th place by fine-tuning Llama 3.1 8B with synthetic data augmentation. It highlights a 'validation trap' phenomenon where high validation scores do not correlate with test performance due to dataset distribution shifts.
Safe Alone, Unsafe Together: Safeguarding Against Implicit Toxicity When Benign Images Combine
This paper defines multi-image implicit toxicity (MIIT), where individually benign images become toxic when combined, and proposes MiShield, a model trained with progressively distilled reasoning supervision to detect MIIT. Experiments show MiShield-8B outperforms existing moderation services.
SafeDiffusion-R1: Online Reward Steering for Safe Diffusion Post-Training
SafeDiffusion-R1 introduces an online reinforcement learning framework using GRPO and a steering reward mechanism to improve safety in diffusion models without requiring supervised data or reward tuning, achieving state-of-the-art performance on multiple harm categories.