Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

arXiv cs.CL Papers

Summary

This replication study evaluates DExperts for mitigating toxicity in LLMs, finding near-perfect safety against explicit toxicity but reduced effectiveness against implicit hate speech and a significant latency trade-off.

arXiv:2605.14087v1 Announce Type: new Abstract: Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to ``toxic degeneration'' where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of \textbf{DExperts} (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using \textbf{RealToxicityPrompts} on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial \textbf{ToxiGen} dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100\%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5\%. Furthermore, we quantify a critical trade-off. The method introduces a $\sim$10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:19 AM

# Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
Source: [https://arxiv.org/html/2605.14087](https://arxiv.org/html/2605.14087)
,Archit RathodUniversity of Illinois ChicagoChicagoIllinoisUSA[arath21@uic\.edu](https://arxiv.org/html/2605.14087v1/mailto:[email protected])andAkshaj Kurra SatishkumarUniversity of Illinois ChicagoChicagoIllinoisUSA[akurr@uic\.edu](https://arxiv.org/html/2605.14087v1/mailto:[email protected])

\(2018\)

###### Abstract\.

Large Language Models \(LLMs\), when trained on web\-scale corpora, inherently absorb toxic patterns from their training data\. This leads to “toxic degeneration” where even innocuous prompts can trigger harmful outputs\. This phenomenon poses significant risks for real\-world deployments\. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety\. In this comprehensive replication study, we evaluate the efficacy ofDExperts\(Decoding\-time Experts\), which is an inference\-time mitigation technique that steers generation without requiring model retraining\. We structured our research into three systematic phases: \(1\) establishing baseline toxicity measurements usingRealToxicityPromptson standard GPT\-2 models; then \(2\) implementing and evaluating DExperts to mitigate explicit toxicity; and finally \(3\) stress\-testing the method against implicit hate speech using the adversarialToxiGendataset\. Our empirical results confirm that while DExperts achieves near\-perfect safety rates \(100%\) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98\.5%\. Furthermore, we quantify a critical trade\-off\. The method introduces a∼\\sim10x latency penalty \(from 0\.2s to 2\.0s per generation\), posing challenges for real\-time deployment scenarios\. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation\. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs\.

††copyright:acmlicensedDeclaration of AI Usage:No AI tools were used in the generation of this report\.

## 1\.Introduction

### 1\.1\.Motivation and Problem Statement

Large Language Models \(LLMs\) such as GPT\-2, GPT\-3, and their successors have demonstrated remarkable capabilities in natural language understanding and generation tasks\([brown2020language,](https://arxiv.org/html/2605.14087#bib.bib1);[radford2019language,](https://arxiv.org/html/2605.14087#bib.bib2)\)\. These models, trained on vast datasets scraped from the internet, exhibit unprecedented fluency and coherence across diverse domains\. However, this same web\-scale training paradigm introduces a critical vulnerability\. The models absorb and reproduce the biases, hate speech, stereotypes, and toxic patterns prevalent in their training corpora\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3);[bender2021dangers,](https://arxiv.org/html/2605.14087#bib.bib4)\)\.

This phenomenon, termedtoxic degeneration, manifests when models generate harmful content even from seemingly innocuous or neutral prompts\. Recent research has demonstrated that this problem extends beyond explicit toxicity to more sophisticated forms\. Zeng et al\.\([zeng2025metaphorical,](https://arxiv.org/html/2605.14087#bib.bib18)\)showed that even state\-of\-the\-art models like GPT\-4o frequently misinterpret metaphorical implicit hate speech, where harmful stereotypes are disguised as seemingly innocuous expressions through rhetorical devices\. For instance, a benign prompt such as “The men started to” might complete as “…fight and kill each other,” or worse, escalate to explicit hate speech targeting specific demographic groups\. Such behaviors pose substantial risks for deploying LLMs in user\-facing applications, including chatbots, content generation tools, and automated writing assistants\. These applications, when exposed to toxic outputs, can cause real harm to users and communities\.

The challenge extends beyond mere detection\. Traditional content moderation approaches, such as keyword\-based filtering and blocklists, suffer from fundamental limitations: they are context\-blind \(e\.g\., blocking “kill” prevents discussing “killing cancer cells” in medical contexts\), easily circumvented through lexical variations, and significantly reduce model utility by over\-censoring legitimate content\. More sophisticated approaches, involving model retraining or fine\-tuning on curated datasets, incur prohibitive computational costs\. This often requires millions of GPU hours and a substantial environmental impact, while still not guaranteeing safety against adversarial inputs\([liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6)\)\.

### 1\.2\.Research Gap and Novelty

While existing research has established that LLMs can generate toxic content\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3)\)and proposed various mitigation strategies\([liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6);[welbl2021challenges,](https://arxiv.org/html/2605.14087#bib.bib7)\), a critical gap persists in understanding how these mitigation techniques perform against diverse forms of toxicity\. Specifically, most safety evaluations focus onexplicit toxicity\. We noticed overt slurs, threats, and profanity that are relatively straightforward to detect\. However, real\-world hate speech frequently manifests asimplicit toxicity: subtle stereotypes, coded language, microaggressions, and statements framed as “polite” opinions that perpetuate harmful biases while evading simple detection mechanisms\.

Our study addresses this gap by conducting a systematic evaluation that spans the spectrum from explicit to implicit toxicity\. Our novel contribution lies in the comprehensive stress\-testing of DExperts against adversarially generated implicit hate speech, revealing fundamental robustness limitations not captured by standard benchmarks\. We quantify the “robustness gap” as the performance degradation when transitioning from explicit to implicit toxicity detection and mitigation\.

### 1\.3\.Research Questions

Our investigation is guided by three primary research questions:

1. \(1\)RQ1 \(Baseline Measurement\):To what extent does a standard, unmitigated pretrained LLM \(GPT\-2\) generate toxic content from non\-toxic prompts? What is the distribution and severity of toxic outputs?
2. \(2\)RQ2 \(Mitigation Efficacy and Trade\-offs\):Can inference\-time control methods \(specifically DExperts\) significantly reduce toxicity without compromising generation quality? What are the computational costs associated with this mitigation?
3. \(3\)RQ3 \(Robustness and Generalization\):Does the mitigation technique generalize effectively to implicit, adversarial hate speech? What is the robustness gap between explicit and implicit toxicity mitigation?

### 1\.4\.Contributions

This work makes the following key contributions:

- •Comprehensive Baseline Analysis:We provide detailed quantitative analysis of baseline GPT\-2 toxicity, revealing that approximately 4\.2% of generations from non\-toxic prompts fall into the “danger zone” \(toxicity score ¿ 0\.5\)\.
- •Mitigation Validation:We successfully replicate and validate the DExperts method, confirming 100% safety rates on standard RealToxicityPrompts benchmarks, representing a complete elimination of the baseline failure rate\.
- •Robustness Gap Identification:We identify and quantify a significant robustness gap: while DExperts performs perfectly on explicit toxicity, safety rates drop to 98\.5% on implicit, adversarial hate speech from ToxiGen, indicating brittleness in generalization\.
- •Cost\-Benefit Analysis:We provide detailed measurements of the computational overhead introduced by DExperts, documenting a 10x increase in inference latency \(from 0\.2s to 2\.0s per generation\), which has important implications for real\-time deployment scenarios\.
- •Methodology Framework:We establish a systematic three\-phase evaluation framework \(Baseline, Mitigation, Adversarial\) that can serve as a template for future work in toxicity mitigation research\.

## 2\.Related Work and Literature Review

### 2\.1\.Toxicity in Language Models

The problem of bias and toxicity in language models has been extensively documented in recent surveys\. Gallegos et al\.\([gallegos2024bias,](https://arxiv.org/html/2605.14087#bib.bib19)\)provide a comprehensive taxonomy of bias evaluation and mitigation techniques, categorizing approaches by intervention stage: pre\-processing \(modifying inputs\), in\-training \(modifying optimization\), intra\-processing \(modifying inference behavior\), and post\-processing \(modifying outputs\)\. This framework helps contextualize the various mitigation strategies we discuss in this section\.

The problem of toxic content generation in neural language models has been extensively documented\. Gehman et al\.\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3)\)introduced the RealToxicityPrompts dataset and demonstrated that even large\-scale models like GPT\-3 exhibit toxic degeneration, generating unsafe content with non\-negligible probability even from seemingly innocuous prompts\. Their work established the Expected Maximum Toxicity metric and showed that larger models do not necessarily generate less toxic content, challenging assumptions about scale improving safety\.

Bender et al\.\([bender2021dangers,](https://arxiv.org/html/2605.14087#bib.bib4)\)provided a broader critique of large language models, documenting their tendency to perpetuate stereotypes and biases from training data\. They highlighted environmental costs and the risks of deploying models trained on unfiltered internet text\. Sheng et al\.\([sheng2019woman,](https://arxiv.org/html/2605.14087#bib.bib5)\)demonstrated systematic gender bias in language generation, showing models tend to associate certain demographics with negative attributes\. These foundational works establish the pervasiveness of the toxicity problem across model architectures and scales\.

### 2\.2\.Mitigation Approaches

Various mitigation strategies have been proposed, which can be broadly categorized into three approaches:

Data Filtering and Curation:Welbl et al\.\([welbl2021challenges,](https://arxiv.org/html/2605.14087#bib.bib7)\)explored training models on filtered datasets, removing toxic content before model training\. While this reduces baseline toxicity, it requires expensive retraining, may reduce model capabilities on certain tasks, and does not eliminate all toxic outputs\.

Fine\-tuning and RLHF:Recent work has explored Reinforcement Learning from Human Feedback \(RLHF\) to align model outputs with human preferences\([ouyang2022training,](https://arxiv.org/html/2605.14087#bib.bib8)\)\. While effective, this approach requires substantial human annotation, is computationally expensive, and can introduce new biases based on annotator preferences\.

An important variant of RLHF is Constitutional AI \(CAI\), proposed by Bai et al\.\([bai2022constitutional,](https://arxiv.org/html/2605.14087#bib.bib24)\)\. Rather than relying on human feedback for every specific output, CAI embeds a predefined set of rules or ”constitution” directly into the training process\. The model learns to critique and revise its own behavior through two phases: a supervised learning phase involving self\-critiques and revisions, followed by reinforcement learning from AI feedback \(RLAIF\) rather than human feedback\. This approach reduces the human annotation burden while maintaining alignment with safety principles, representing a promising alternative to traditional RLHF for toxicity mitigation\.

Inference\-time Control:Liu et al\.\([liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6)\)proposed DExperts, which we replicate in this study\. Their method manipulates decoding probabilities using expert and anti\-expert models, avoiding the need for retraining the base model\. PPLM\([dathathri2019plug,](https://arxiv.org/html/2605.14087#bib.bib9)\)and FUDGE\([yang2021fudge,](https://arxiv.org/html/2605.14087#bib.bib10)\)represent alternative inference\-time approaches using different control mechanisms\. Our work extends the DExperts evaluation to adversarial scenarios not covered in the original paper\.

Alternative inference\-time approaches have also been proposed to address toxicity mitigation\. Gururangan et al\.\([suau2024whispering,](https://arxiv.org/html/2605.14087#bib.bib14)\)introduced AUROC adaptation \(AurA\), which identifies neurons responsible for toxicity based on their discriminative power and reduces their activation levels proportionally, achieving up to 2\.2× reduction in toxicity with only 0\.72 perplexity increase\. Unlike DExperts, which combines expert and anti\-expert models, AurA operates through direct neural intervention at the neuron level\.

Lee et al\.\([kim2023gta,](https://arxiv.org/html/2605.14087#bib.bib15)\)proposed Gated Toxicity Avoidance \(GTA\), which specifically addresses the performance preservation challenge during toxicity mitigation\. Their method maintains grammar, topic consistency, and perplexity while reducing toxicity, directly tackling the quality\-safety trade\-off that we observe in our DExperts evaluation\.

More broadly, Liang et al\.\([liang2024controllable,](https://arxiv.org/html/2605.14087#bib.bib23)\)provide a comprehensive survey of controllable text generation methods for LLMs, categorizing approaches into model retraining, fine\-tuning, reinforcement learning, prompt engineering, latent space manipulation, and decoding\-time intervention\. This taxonomy helps position DExperts within the broader landscape of controllable generation techniques\.

### 2\.3\.Knowledge Editing Approaches

A fundamentally different approach to toxicity mitigation involves directly editing model parameters to remove toxic knowledge, rather than suppressing it at inference time\. Wang et al\.\([wang2024detoxifying,](https://arxiv.org/html/2605.14087#bib.bib21)\)introduced Detoxifying with Intraoperative Neural Monitoring \(DINM\), which diminishes the toxicity of parameters within a few tuning steps via only one instance\. Their SafeEdit benchmark covers nine unsafe categories with various attack prompts and comprehensive evaluation metrics\.

Critically, their analysis demonstrates that methods like supervised fine\-tuning \(SFT\) and DPO may merely suppress the activations of toxic parameters, while DINM mitigates toxicity to a greater extent through permanent parameter adjustments\. This distinction is important: inference\-time methods like DExperts modify outputs during generation, while knowledge editing approaches like DINM make permanent changes to the model’s internal representations\. However, knowledge editing approaches require careful validation to ensure they do not harm model capabilities on benign tasks\.

### 2\.4\.Adversarial and Implicit Hate Speech

Hartvigsen et al\.\([hartvigsen2022toxigen,](https://arxiv.org/html/2605.14087#bib.bib11)\)introduced ToxiGen, a dataset of adversarially generated implicit hate speech targeting specific demographic groups\. They demonstrated that standard toxicity classifiers struggle with implicit hate, achieving lower performance compared to explicit hate detection\. This dataset enables systematic evaluation of model robustness against coded and subtle toxicity\.

Recent work has further explored the challenges of implicit toxicity detection and generation\. Sheng et al\.\([wen2023unveiling,](https://arxiv.org/html/2605.14087#bib.bib16)\)demonstrated that LLMs can generate diverse implicit toxic outputs through reinforcement learning\-based methods that specifically evade standard toxicity classifiers\. Their work employs an adversarial approach where models are explicitly rewarded for generating content that is harmful yet classified as non\-toxic by existing detectors, revealing fundamental limitations in current detection systems\.

Vidgen et al\.\([roy2023probing,](https://arxiv.org/html/2605.14087#bib.bib17)\)provided a detailed error typology showing where LLMs fail in hate speech detection, particularly on implicit hate from the ToxiGen dataset\. They evaluated LLMs including text\-davinci and Flan\-T5 on HateXplain, implicit hate, and ToxicSpans datasets, finding that including target information in the pipeline improves model performance substantially \(approximately 20\-30

Most recently, Zeng et al\.\([zeng2025metaphorical,](https://arxiv.org/html/2605.14087#bib.bib18)\)revealed that even advanced models like GPT\-4o frequently misinterpret metaphorical implicit hate speech, where hateful intent is disguised through rhetorical devices\. They employed jailbreaking strategies and energy\-based constrained decoding techniques, demonstrating that specialized safety models like ShieldGemma and LlamaGuard inadequately block such content, often misclassifying it as harmless\. This represents the cutting edge of adversarial implicit toxicity and directly motivates our stress\-testing approach in Phase 3\.

Sap et al\.\([sap2019risk,](https://arxiv.org/html/2605.14087#bib.bib12)\)explored social bias frames, showing that offensive language often relies on implied stereotypes rather than explicit slurs\. Their work highlights the inadequacy of keyword\-based approaches for detecting nuanced forms of hate speech, motivating our investigation of DExperts’ performance on implicit toxicity\.

### 2\.5\.Positioning of Our Work

Our work builds upon these foundations by providing a systematic bridge between explicit and implicit toxicity mitigation\. While the original DExperts paper\([liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6)\)demonstrated effectiveness on RealToxicityPrompts, our study extends this evaluation to adversarial implicit hate speech using ToxiGen, revealing important generalization limitations\. Unlike previous work that focuses on either explicit toxicity benchmarks or implicit hate detection separately, we provide a unified evaluation framework that exposes the robustness gap between these regimes\. Additionally, we provide detailed computational cost analysis, which is often underexplored in the literature but critical for practical deployment decisions\.

## 3\.Methodology and Technical Approach

Our experimental methodology is structured into three sequential phases, each designed to systematically answer one of our research questions\. We adopted a collaborative team\-based approach, dividing the dataset processing among three team members to ensure computational feasibility while maintaining methodological rigor\.

### 3\.1\.Phase 1: Baseline Toxicity Measurement \(RQ1\)

#### 3\.1\.1\.Objectives

Establish ground\-truth toxicity characteristics of unmitigated GPT\-2 models to quantify the scope and severity of toxic degeneration in a standard pretrained language model\.

#### 3\.1\.2\.Experimental Setup

We utilized GPT\-2 Small \(117M parameters\) as our baseline model\. This choice was motivated by: \(1\) computational feasibility for a replication study, \(2\) its widespread use as a benchmark model in toxicity research, and \(3\) availability of compatible expert/anti\-expert models for Phase 2\.

We employed the RealToxicityPrompts dataset\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3)\), which contains 99,442 naturally occurring sentence fragments extracted from the OpenWebText Corpus\. Each prompt in this dataset represents real web text, providing ecological validity for evaluating toxic degeneration\.

Dataset Partitioning:To manage computational resources, we divided the dataset among three team members, with each member processing a specific index range of prompts from the RealToxicityPrompts dataset\. This partitioning strategy enabled parallel processing while ensuring complete coverage of our evaluation subset without overlap or gaps\.

#### 3\.1\.3\.Generation Parameters

For each prompt, we generated a single continuation using the following parameters:

- •Model:GPT\-2 Small \(117M parameters,gpt2\)
- •Sampling Method:Nucleus sampling \(top\-p\) withp=0\.9p=0\.9
- •Temperature:T=1\.0T=1\.0\(standard setting for natural text generation\)
- •Max New Tokens:20 tokens per continuation
- •Do Sample:True \(enables stochastic generation\)

These generation parameters align with standard practices in the literature\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3)\)and balance fluency with diversity\.

#### 3\.1\.4\.Toxicity Evaluation Metrics

All generated continuations were scored using the Perspective API\([perspectiveapi,](https://arxiv.org/html/2605.14087#bib.bib13)\), an industry\-standard toxicity detection system developed by Google Jigsaw\. Perspective API provides probability scores ranging from 0 \(non\-toxic\) to 1 \(highly toxic\) across multiple dimensions:

- •Toxicity:General harmful, offensive, or disrespectful content
- •Severe Toxicity:Extremely offensive content likely to cause conversational disengagement
- •Identity Attack:Negative or hateful comments targeting protected demographic groups

We focused our analysis on Toxicity as the primary dimension, as it captures the most critical safety concerns for language model deployment\.

Safety Threshold:Following established conventions in the toxicity literature\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3);[liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6)\), we defined the “Danger Zone” as any output with a toxicity score≥0\.5\\geq 0\.5\. Outputs exceeding this threshold are considered unsafe for production deployment\.

### 3\.2\.Phase 2: Mitigation with DExperts \(RQ2\)

#### 3\.2\.1\.DExperts Algorithm

DExperts\([liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6)\)is an inference\-time control method that steers language generation by combining predictions from three models:

- •Pbase​\(xt\|x<t\)P\_\{\\text\{base\}\}\(x\_\{t\}\|x\_\{<t\}\): The base language model \(GPT\-2\)
- •Pexpert​\(xt\|x<t\)P\_\{\\text\{expert\}\}\(x\_\{t\}\|x\_\{<t\}\): Expert model fine\-tuned on non\-toxic text
- •Panti​\(xt\|x<t\)P\_\{\\text\{anti\}\}\(x\_\{t\}\|x\_\{<t\}\): Anti\-expert model fine\-tuned on toxic text

At each decoding steptt, the modified probability distribution is computed as:

\(1\)P​\(xt\|x<t\)∝Pbase​\(xt\|x<t\)⋅\(Pexpert​\(xt\|x<t\)Panti​\(xt\|x<t\)\)αP\(x\_\{t\}\|x\_\{<t\}\)\\propto P\_\{\\text\{base\}\}\(x\_\{t\}\|x\_\{<t\}\)\\cdot\\left\(\\frac\{P\_\{\\text\{expert\}\}\(x\_\{t\}\|x\_\{<t\}\)\}\{P\_\{\\text\{anti\}\}\(x\_\{t\}\|x\_\{<t\}\)\}\\right\)^\{\\alpha\}
whereα\\alphais a hyperparameter controlling the strength of steering\. This can equivalently be expressed in log\-probability space as:

\(2\)log⁡P​\(xt\|x<t\)=log⁡Pbase​\(xt\|x<t\)\+α​\(log⁡Pexpert​\(xt\|x<t\)−log⁡Panti​\(xt\|x<t\)\)\\log P\(x\_\{t\}\|x\_\{<t\}\)=\\log P\_\{\\text\{base\}\}\(x\_\{t\}\|x\_\{<t\}\)\+\\alpha\(\\log P\_\{\\text\{expert\}\}\(x\_\{t\}\|x\_\{<t\}\)\-\\log P\_\{\\text\{anti\}\}\(x\_\{t\}\|x\_\{<t\}\)\)
The intuition is that the expert model assigns high probability to non\-toxic tokens, while the anti\-expert assigns high probability to toxic tokens\. By boosting the expert and suppressing the anti\-expert, we steer generation toward safer outputs\.

#### 3\.2\.2\.Implementation Details

We utilized pre\-trained expert and anti\-expert models provided by the original DExperts authors\([liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6)\):

- •Expert Model:finetuned\_gpt2\_nontoxic\- GPT\-2 Small fine\-tuned on non\-toxic comments from the Jigsaw Unintended Bias dataset
- •Anti\-Expert Model:finetuned\_gpt2\_toxic\- GPT\-2 Small fine\-tuned on toxic comments from the Jigsaw dataset

These models were loaded from the checkpoint paths provided in the original DExperts repository and used without modification\.

Hyperparameter Configuration:Based on preliminary experiments to address repetitive text generation, we selected the following parameters:

- •α=1\.5\\alpha=1\.5\(steering strength, reduced from initial value of 2\.0 to improve text naturalness\)
- •Repetition penalty: 1\.2 \(to reduce repetitive outputs\)
- •Temperature: 1\.0 \(maintaining diversity\)
- •Nucleus sampling:p=0\.9p=0\.9
- •Max new tokens: 20 \(consistent with Phase 1\)

For fair comparison with Phase 1, we used identical generation parameters \(nucleus sampling withp=0\.9p=0\.9, max length 20 tokens\) and evaluated on the same prompt subset from RealToxicityPrompts\.

#### 3\.2\.3\.Evaluation Metrics

In addition to toxicity scores from Perspective API \(Toxicity, Severe Toxicity, Identity Attack\), we measured:

- •Safety Rate:Percentage of generations with toxicity score<0\.5<0\.5
- •Inference Time:Wall\-clock time per generation \(seconds\), measured from prompt input to completion

### 3\.3\.Phase 3: Adversarial Robustness Testing \(RQ3\)

#### 3\.3\.1\.Objectives

Evaluate whether DExperts generalizes to implicit, adversarial hate speech that differs qualitatively from the explicit toxicity in RealToxicityPrompts\.

#### 3\.3\.2\.ToxiGen Dataset

ToxiGen\([hartvigsen2022toxigen,](https://arxiv.org/html/2605.14087#bib.bib11)\)contains 274,000 machine\-generated statements exhibiting implicit hate speech targeting 13 minority groups \(e\.g\., Black individuals, LGBTQ\+ community, Muslims, disabled people\)\. Unlike RealToxicityPrompts, which contains naturally occurring text, ToxiGen examples are specifically designed to:

- •Avoid explicit slurs and profanity
- •Express harmful stereotypes through seemingly innocuous statements
- •Use coded language and microaggressions
- •Maintain grammatical correctness and surface\-level politeness

Example ToxiGen statement: “The vast majority of \[group\] are poor because they don’t value education\.” This avoids explicit slurs while promoting harmful stereotypes\.

#### 3\.3\.3\.Experimental Setup

We selected a subset of prompts from the ToxiGen dataset, with each team member processing a specific index range\. We used the ’prompt’ field from the dataset, which provides statement prefixes designed to elicit potentially toxic completions\.

Generation Parameters:

- •Model: GPT\-2 Small with DExperts \(α=1\.5\\alpha=1\.5\)
- •Sampling: Nucleus sampling withp=0\.9p=0\.9
- •Max new tokens: 30 \(increased from 20 to allow the model more opportunity to generate implicit hate\)
- •Repetition penalty: 1\.2
- •Temperature: 1\.0

The increased token limit for Phase 3 \(30 vs\. 20\) was chosen to give the model sufficient generation space where implicit biases might manifest, as implicit hate speech often requires more subtle phrasing than explicit toxicity\.

#### 3\.3\.4\.Analysis Dimensions

We analyzed:

1. \(1\)Safety Rate Comparison:Phase 2 \(explicit\) vs\. Phase 3 \(implicit\)
2. \(2\)Toxicity Distribution Shift:Comparing distributions between phases
3. \(3\)Inference Time Analysis:Additional computational cost for adversarial prompts
4. \(4\)Failure Mode Analysis:Qualitative examination of cases where DExperts failed to prevent toxic outputs

### 3\.4\.Technical Challenges

Several technical challenges emerged during implementation:

- •API Rate Limiting:Perspective API has strict rate limits \(1 request/second for free tier\)\. We implemented caching strategies and processed evaluations over extended periods\.
- •Model Inference Optimization:Loading three models simultaneously \(base, expert, anti\-expert\) created GPU memory constraints\. We managed this through careful batch size selection and model loading strategies\.
- •Reproducibility:We documented all hyperparameters, random seeds, and dataset indices to ensure reproducibility of results\.
- •Collaborative Coordination:Division of labor among three team members required careful coordination to ensure consistent methodology and non\-overlapping index ranges\.

## 4\.Datasets and Evaluation Infrastructure

### 4\.1\.RealToxicityPrompts

Source:Gehman et al\. \(2020\)\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3)\)

Description:A dataset of 99,442 naturally occurring sentence fragments extracted from the OpenWebText Corpus\. Each prompt represents a real web text snippet, annotated with toxicity scores using Perspective API\.

Our Usage:

- •Used subset of prompts divided among three team members
- •Used for both Phase 1 \(baseline\) and Phase 2 \(mitigation\) evaluation
- •Generated single continuation per prompt

Characteristics:

- •Diverse domains \(news, social media, forums, blogs\)
- •Variable prompt toxicity \(majority non\-toxic, with toxic prompts for challenge\)
- •Representative of real\-world text distribution

### 4\.2\.ToxiGen

Source:Hartvigsen et al\. \(2022\)\([hartvigsen2022toxigen,](https://arxiv.org/html/2605.14087#bib.bib11)\)

Description:A machine\-generated dataset of 274,000 implicitly toxic statements targeting 13 minority identity groups\. Generated using GPT\-3 with careful prompting to create subtle, coded hate speech\.

Our Usage:

- •Used subset of prompts divided among team members
- •Used exclusively for Phase 3 \(adversarial robustness testing\)
- •Used the ’prompt’ field to elicit model completions

Target Groups:Black people, Asian people, Latino people, Middle Eastern people, Native American people, Pacific Islander people, Jewish people, Muslim people, LGBTQ\+ people, Women, Disabled people, Chinese people, Mexican people\.

Characteristics:

- •Implicitly toxic \(avoids explicit slurs\)
- •Grammatically correct and coherent
- •Expresses stereotypes through coded language
- •Designed to evade simple keyword filters

### 4\.3\.Expert and Anti\-Expert Models

Source:DExperts official repository\([liu2021dexperts,](https://arxiv.org/html/2605.14087#bib.bib6)\)

Description:Pre\-trained GPT\-2 Small models fine\-tuned on curated toxic and non\-toxic subsets from the Jigsaw Unintended Bias dataset\.

Models Used:

- •Expert \(finetuned\_gpt2\_nontoxic\):Trained on non\-toxic comments \(toxicity score<0\.5<0\.5\) from Jigsaw dataset
- •Anti\-Expert \(finetuned\_gpt2\_toxic\):Trained on toxic comments \(toxicity score≥0\.5\\geq 0\.5\) from Jigsaw dataset

Our Usage:

- •Downloaded pre\-trained checkpoints from DExperts repository
- •Used without modification for Phase 2 and Phase 3
- •No direct interaction with Jigsaw dataset for training

### 4\.4\.Evaluation Infrastructure: Perspective API

Tool:Google Jigsaw Perspective API\([perspectiveapi,](https://arxiv.org/html/2605.14087#bib.bib13)\)

Function:Industry\-standard toxicity scoring system using machine learning models trained on millions of human\-annotated comments\.

Technical Details:

- •REST API with JSON request/response
- •Returns probability scores \(0\-1\) for multiple attributes
- •Rate limits: 1 query/second \(free tier\)
- •Supports multiple languages \(we used English\)

Our Implementation:

- •Implemented caching layer to avoid redundant API calls
- •Batched requests with rate limit compliance
- •Stored all raw API responses for reproducibility

Limitations:

- •Perspective API itself may have biases \(e\.g\., flagging African American Vernacular English as toxic\)
- •Not perfect at detecting implicit toxicity
- •Treats toxicity as scalar rather than multidimensional

### 4\.5\.Data Processing Pipeline

Our complete data processing pipeline consisted of:

1. \(1\)Prompt Selection:Index\-based partitioning from RealToxicityPrompts and ToxiGen
2. \(2\)Text Generation:Running GPT\-2 \(baseline\) or DExperts \(mitigated\) to generate continuations
3. \(3\)Toxicity Scoring:Querying Perspective API for all generations
4. \(4\)Data Storage:Storing prompts, generations, API responses, and metadata in structured JSON format
5. \(5\)Analysis:Computing aggregate statistics, distributions, and visualizations

All code and data processing scripts are documented for reproducibility\.

## 5\.Evaluation Results and Findings

### 5\.1\.Finding 1: The Baseline Danger Zone \(RQ1\)

Our Phase 1 analysis quantified the severity of toxic degeneration in unmitigated GPT\-2 models\.

#### 5\.1\.1\.Overall Toxicity Distribution

Analysis of the baseline GPT\-2 generations revealed a characteristic distribution with a persistent toxic tail:

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/phase1_detail_1_toxicity.png)Figure 1\.Baseline Toxicity Distribution from Phase 1\. The distribution shows a characteristic long\-tail shape with the majority of generations clustered at low toxicity scores, but with a persistent tail extending into the ”Danger Zone” \(toxicity ¿ 0\.5\)\.The distribution exhibited a characteristic long\-tail shape: the majority of generations showed low toxicity scores, clustering near zero\. However, a persistent tail extended into dangerous territory\.

#### 5\.1\.2\.Safety Rate Analysis

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/phase1_plot_3_safety_percentage.png)Figure 2\.Baseline Safety Success Rate showing that 95\.8% of generations fall below the 0\.5 toxicity threshold \(safe\), while 4\.2% exceed it \(toxic\)\. This 4\.2% failure rate represents a significant safety concern for production deployment\.Detailed analysis of the safety threshold \(toxicity<0\.5<0\.5\) revealed:

- •95\.8%of generations were safe \(below 0\.5 threshold\)
- •4\.2%of generations exceeded the 0\.5 toxicity threshold \(“Danger Zone”\)

As illustrated in Figure[2](https://arxiv.org/html/2605.14087#S5.F2), this 4\.2% failure rate represents a non\-trivial risk for any production deployment\. Even though the majority of outputs are safe, the presence of toxic outputs in this proportion would be unacceptable in most user\-facing applications\.

#### 5\.1\.3\.Key Insight

The baseline evaluation confirms a structural safety problem in unmitigated GPT\-2: even from non\-toxic prompts, the model exhibits a persistent tendency toward toxic outputs with a long\-tail distribution\. This quantifies the scope of the problem and establishes the necessity for mitigation strategies\.

### 5\.2\.Finding 2: The “Happy Path” — DExperts Efficacy \(RQ2\)

Phase 2 evaluated DExperts’ ability to mitigate explicit toxicity on RealToxicityPrompts\.

#### 5\.2\.1\.Toxicity Reduction

DExperts achieved dramatic toxicity reduction compared to baseline\. The percentage of outputs in the Danger Zone \(≥0\.5\\geq 0\.5toxicity\):

- •Baseline:4\.2%
- •DExperts:0\.0%
- •Safety Rate:100%

This represents a complete elimination of the toxic tail observed in baseline GPT\-2, as visualized in Figure[3](https://arxiv.org/html/2605.14087#S5.F3)\.

#### 5\.2\.2\.Distribution Shift Analysis

The toxicity distribution underwent a fundamental transformation:

- •Baseline:Long\-tail distribution with extended right tail
- •DExperts:Sharp, concentrated distribution clustered near zero

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/plot_1_toxicity_distribution.png)Figure 3\.Shift in Toxicity Distribution with DExperts Mitigation\. The baseline distribution \(red, showing the toxic tail\) is completely transformed by DExperts \(green\), which compresses all generations into a tight, safe cluster near zero toxicity\. This represents complete elimination of the 4\.2% failure rate observed in Phase 1\.![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/plot_6_toxicity_cdf.png)Figure 4\.CDF comparison between baseline and DExperts, showing the dramatic compression of the toxicity distribution\. The DExperts curve reaches 100% safety well before the 0\.5 threshold, while the baseline curve shows gradual increase past the danger zone\.Visual inspection of density plots \(Figure[3](https://arxiv.org/html/2605.14087#S5.F3)and Figure[4](https://arxiv.org/html/2605.14087#S5.F4)\) confirmed that DExperts successfully ”compressed” the distribution, eliminating high\-toxicity outliers while maintaining low baseline toxicity\.

#### 5\.2\.3\.Computational Cost

Inference Latency:

- •Baseline GPT\-2:Mean≈\\approx0\.2s
- •DExperts:Mean≈\\approx2\.0s
- •Slowdown Factor:∼\\sim10x

The 10x latency increase stems from:

1. \(1\)Running three models instead of one \(base, expert, anti\-expert\)
2. \(2\)Computing and combining logits at each decoding step
3. \(3\)Additional memory access overhead

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/plot_3_safety_percentage.png)\(a\)Safety success rate: 95\.8% baseline vs\. 100% DExperts
![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/plot_4_latency_tradeoff.png)\(b\)Computational cost: 10x latency increase

Figure 5\.Trade\-offs in DExperts mitigation: \(a\) Perfect safety achievement on RealToxicityPrompts with 100% safe generations, eliminating all baseline failures\. \(b\) Computational overhead showing mean latency increase from 0\.2s to 2\.0s per generation, representing a significant barrier for real\-time deployment\.For a typical 20\-token generation, DExperts requires approximately 2 seconds on standard GPU hardware, compared to 0\.2 seconds for baseline \(see Figure[5](https://arxiv.org/html/2605.14087#S5.F5)\)\. This represents a significant barrier for real\-time applications \(e\.g\., chatbots, autocomplete\) where sub\-second response times are expected\.

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/plot_5_worst_case_analysis.png)Figure 6\.Worst\-case analysis comparing the top 10% most toxic outputs\. Even in worst\-case scenarios, DExperts maintains toxicity well below the 0\.5 threshold \(green boxplot\), while baseline worst\-case outputs \(red\) extend deep into the danger zone\.Even in worst\-case analysis scenarios, DExperts maintains safety well below the danger threshold, as demonstrated in Figure[6](https://arxiv.org/html/2605.14087#S5.F6)\.

#### 5\.2\.4\.Key Insight

On standard explicit toxicity benchmarks \(RealToxicityPrompts\), DExperts is highly effective, achieving perfect safety while maintaining acceptable quality\. However, this comes at a substantial computational cost that may be prohibitive for latency\-sensitive applications\.

### 5\.3\.Finding 3: The Robustness Gap \(RQ3\)

Phase 3 stress\-tested DExperts against adversarial implicit hate speech using ToxiGen\.

#### 5\.3\.1\.Safety Rate Comparison

- •Phase 2 \(RealToxicityPrompts\):100\.0% safe
- •Phase 3 \(ToxiGen\):98\.5% safe \(1\.5% toxic\)

While 98\.5% remains a high safety rate in absolute terms, the1\.5% failure raterepresents a significant degradation from the perfect performance on explicit toxicity \(Figure[8](https://arxiv.org/html/2605.14087#S5.F8)\)\. This indicates ”leakage” of implicit hate speech through the mitigation mechanism\.

#### 5\.3\.2\.Toxicity Distribution Analysis

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/phase3_plot_1_robustness_gap.png)Figure 7\.The Robustness Gap: Violin plot comparison between Phase 2 \(RealToxicityPrompts, teal\) and Phase 3 \(ToxiGen, purple\)\. While Phase 2 shows a flat, compact distribution entirely below the 0\.5 threshold, Phase 3 exhibits a concerning bulge that crosses into the danger zone, demonstrating DExperts’ brittleness against implicit hate speech\.![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/phase3_plot_3_safety_success.png)Figure 8\.Safety consistency comparison across phases\. Phase 2 achieves 100% safety on explicit toxicity, while Phase 3 degrades to 98\.5%, revealing a 1\.5% leakage rate for implicit hate speech\.Figure[7](https://arxiv.org/html/2605.14087#S5.F7)clearly visualizes the robustness gap between explicit and implicit toxicity mitigation\. Violin plots comparing Phase 2 and Phase 3 distributions revealed:

Phase 2 \(Explicit — RealToxicityPrompts\):

- •Tight, compact distribution centered near zero
- •Negligible variance
- •No values exceeding 0\.5 threshold

Phase 3 \(Implicit — ToxiGen\):

- •Broader, more dispersed distribution
- •Visible ”bulge” extending past 0\.5 threshold
- •Some outputs reaching moderate toxicity levels

The distribution difference quantifies therobustness gap: DExperts’ mitigation strength degrades when confronted with implicit, coded hate speech that doesn’t rely on explicit toxic keywords\.

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/phase3_plot_4_worst_case.png)Figure 9\.Phase 3 worst\-case analysis showing the top 10% most toxic outputs from adversarial prompts\. Even the worst cases remain relatively controlled, but the presence of any failures contrasts sharply with Phase 2’s perfect record\.![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/phase3_plot_6_cdf.png)Figure 10\.CDF curves showing the robustness gap\. Phase 2 \(teal\) maintains perfect safety, while Phase 3 \(purple\) shows a small but significant proportion of outputs exceeding safety thresholds\.
#### 5\.3\.3\.Failure Mode Analysis

Qualitative examination of failed cases revealed patterns consistent with broader research on implicit hate speech detection failures\([roy2023probing,](https://arxiv.org/html/2605.14087#bib.bib17);[zeng2025metaphorical,](https://arxiv.org/html/2605.14087#bib.bib18)\):

Stereotyping through ”Factual” Claims:

- •Statements framed as statistics or general observations
- •DExperts fails because content avoids explicit slurs while perpetuating harmful stereotypes

Coded Language:

- •Seemingly descriptive language to convey dehumanizing content
- •Uses neutral\-appearing vocabulary to express bias

Microaggressions:

- •Appears complimentary but implies negative stereotypes
- •Subtle put\-downs disguised as polite observations

These patterns suggest that DExperts’ anti\-expert, trained primarily on explicitly toxic Jigsaw comments, lacks sufficient exposure to implicit hate patterns, resulting in incomplete coverage\.

### 5\.4\.Finding 4: The Double Penalty \- Latency vs\. Safety \(RQ3\)

Analysis of the relationship between computational cost and mitigation efficacy on adversarial prompts revealed a concerning pattern\.

#### 5\.4\.1\.Latency Distribution Shift

![Refer to caption](https://arxiv.org/html/2605.14087v1/plots/phase3_plot_5_latency_dist.png)Figure 11\.Computational overhead comparison showing histogram of inference latency across phases\. Phase 3 adversarial prompts \(purple\) show increased latency with a long tail, compared to Phase 2’s more compact distribution \(green\)\.Phase 2 \(Explicit\):

- •Mean latency: 2\.0s
- •Distribution: Compact, normally distributed

Phase 3 \(Adversarial\):

- •Mean latency: 3\.2s
- •Distribution: Right\-skewed with long tail
- •Some generations extending to 5\+ seconds

Theadversarial overhead\(additional latency for ToxiGen prompts\) averaged approximately 1\.2 seconds, representing a 60% increase over standard DExperts latency\. This indicates that adversarial prompts not only defeat the safety mechanism but also impose additional computational burden\.

Interpretation:When DExperts encounters difficult adversarial prompts, it:

1. \(1\)Expends more computation time \(higher latency\) attempting to find safe continuations
2. \(2\)Still fails to suppress toxicity effectively

This represents a ”double penalty”: the model works harder \(increased latency\) yet produces worse outcomes \(higher toxicity\)\. This suggests fundamental limitations in the steering mechanism’s ability to recognize and mitigate implicit hate patterns\.

### 5\.5\.Summary of Key Findings

1. \(1\)Baseline Risk:Unmitigated GPT\-2 exhibits a 4\.2% toxic generation rate from non\-toxic prompts, establishing a clear need for mitigation\.
2. \(2\)Explicit Mitigation Success:DExperts achieves 100% safety on RealToxicityPrompts, completely eliminating the toxic tail, validating its effectiveness for explicit toxicity\.
3. \(3\)Robustness Gap:Safety degrades to 98\.5% on adversarial implicit hate speech \(ToxiGen\), revealing brittleness and generalization failures\.
4. \(4\)Computational Trade\-off:DExperts introduces 10x latency overhead, with additional adversarial overhead of 60%, posing practical deployment challenges\.
5. \(5\)Double Penalty Phenomenon:Difficult adversarial cases incur both higher latency and higher toxicity, indicating fundamental limitations in the steering mechanism\.

## 6\.Discussion

### 6\.1\.Implications for AI Safety

Our findings have important implications for the deployment of LLMs in real\-world applications:

Laboratory vs\. Real\-World Gap:The robustness gap between explicit and implicit toxicity mitigation highlights a dangerous disconnect between safety benchmarks and actual deployment scenarios\. Luong et al\.\([luong2024realistic,](https://arxiv.org/html/2605.14087#bib.bib20)\)introduced the Thoroughly Engineered Toxicity \(TET\) dataset, comprising 2,546 prompts filtered from over 1 million real\-world interactions with 25 different LLMs\. Their work demonstrates that models evaluated on synthetic benchmarks like RealToxicityPrompts exhibit significantly different toxicity patterns when confronted with realistic adversarial prompts\. Specifically, TET consistently elicits more toxicity than ToxiGen when prompt toxicity levels are similar, and models show substantially reduced effectiveness against manually crafted jailbreak templates\. This emphasizes that the 98\.5% safety rate we observe on ToxiGen, while concerning, may still overestimate real\-world robustness\.

Chao et al\.\([chao2024jailbreakbench,](https://arxiv.org/html/2605.14087#bib.bib26)\)further underscore this challenge through JailbreakBench, an open\-sourced benchmark with evolving adversarial prompts, standardized evaluation frameworks, and a public leaderboard tracking attack and defense performance\. Their work reveals that even well\-defended models can be systematically compromised through carefully engineered prompts, suggesting that static evaluation on fixed datasets like ToxiGen provides an incomplete picture of model safety\.

Arms Race Dynamics:The brittleness of DExperts against implicit hate suggests an adversarial arms race: as mitigation techniques improve at detecting explicit toxicity, bad actors may shift to more subtle forms of hate speech\. Future mitigation approaches must anticipate this adaptation\.

Computational Feasibility:The 10x latency penalty raises questions about the practical feasibility of inference\-time control for interactive applications\. While acceptable for offline content generation \(e\.g\., marketing copy, code comments\), the 2\-second response time may be unacceptable for real\-time chat or autocomplete scenarios where users expect instant feedback\.

### 6\.2\.Limitations

Several limitations constrain the generalizability of our findings:

Model Scale:Our experiments used GPT\-2 Small \(117M parameters\) due to computational constraints\. Larger models \(GPT\-2 Medium/Large, GPT\-3\) may exhibit different toxicity characteristics and respond differently to DExperts\. However, prior work\([gehman2020realtoxicityprompts,](https://arxiv.org/html/2605.14087#bib.bib3)\)suggests larger models are not necessarily safer\.

Sample Size:While our evaluation provides statistically robust estimates, we evaluated only a subset of available prompts\. A full evaluation on complete datasets would strengthen confidence\.

Perspective API Limitations:Our reliance on the Perspective API as the ground\-truth toxicity metric introduces potential biases\. Perspective API is known to exhibit false positives on African American Vernacular English and may struggle with implicit hate detection\. Ideally, evaluation would incorporate multiple toxicity classifiers and human annotation\.

Language and Cultural Context:Our study focused exclusively on English language toxicity\. Hate speech patterns, coded language, and cultural context vary substantially across languages, limiting generalizability\. Jain et al\.\([jain2024polyglotoxicity,](https://arxiv.org/html/2605.14087#bib.bib22)\)introduced PolygloToxicityPrompts \(PTP\), covering 9 languages across 5 different scripts with models ranging from 1\.3B to 13B parameters\. Their multilingual evaluation reveals that toxicity patterns and mitigation effectiveness vary significantly across languages, with translated data sometimes outperforming in\-language training data \(38% vs 33% toxicity reduction for high\-resource languages\)\. This suggests that our English\-only findings may not transfer directly to other linguistic contexts, and that multilingual evaluation is essential for global deployment of toxicity mitigation systems\.

Single Mitigation Method:We evaluated only DExperts\. Other inference\-time methods \(PPLM, FUDGE\) or alternative approaches \(RLHF, constitutional AI\) may exhibit different robustness profiles\.

### 6\.3\.Future Work

Our findings point to several promising research directions:

Hybrid Mitigation:Combining multiple approaches \(e\.g\., DExperts for explicit toxicity \+ fine\-grained classifiers for implicit hate\) may achieve better coverage across toxicity types\.

Lightweight Expert Models:Developing smaller, distilled expert/anti\-expert models could reduce computational overhead while maintaining safety\.

Adversarial Training:Training anti\-experts specifically on implicit hate datasets like ToxiGen may close the robustness gap\.

Context\-Aware Mitigation:Incorporating broader conversational context \(beyond single prompts\) may improve detection of subtle toxicity patterns\.

Human\-in\-the\-Loop:For high\-stakes applications, combining automated mitigation with human review workflows may provide optimal safety\-utility balance\.

Cross\-Lingual Evaluation:Extending this evaluation framework to non\-English languages would assess generalizability and reveal language\-specific challenges\.

## 7\.Conclusion

This comprehensive replication study evaluated the DExperts inference\-time control method for toxicity mitigation in large language models across a spectrum from explicit to implicit hate speech\. Our systematic three\-phase evaluation quantified both the strengths and fundamental limitations of current mitigation approaches\.

We demonstrated that DExperts achieves exceptional performance on explicit toxicity benchmarks, eliminating the 4\.2% baseline failure rate\. However, it exhibits brittleness when confronted with adversarial implicit hate speech, with safety rates degrading to 98\.5%\. Furthermore, the method imposes a significant computational burden, introducing 10x latency overhead that escalates further for difficult adversarial inputs\.

These findings underscore that perfect performance on standard benchmarks does not guarantee robustness in real\-world deployment scenarios\. As language models continue scaling and deploying in user\-facing applications, the field must develop mitigation strategies that generalize across diverse toxicity patterns from overt slurs to coded stereotypes, while remaining computationally practical for interactive use cases\.

The robustness gap we identified represents both a challenge and an opportunity\. By exposing the limitations of current methods, we hope to motivate the development of next\-generation approaches that achieve comprehensive safety without prohibitive costs, ultimately enabling the responsible deployment of powerful language technologies\.

## References

- \(1\)Tom B\. Brown et al\.Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2020\.
- \(2\)Alec Radford et al\.Language models are unsupervised multitask learners\.OpenAI Blog, 2019\.
- \(3\)Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A\. Smith\.RealToxicityPrompts: Evaluating neural toxic degeneration in language models\.InFindings of EMNLP, pages 3356–3369, 2020\.
- \(4\)Emily M\. Bender, Timnit Gebru, Angelina McMillan\-Major, and Shmargaret Shmitchell\.On the dangers of stochastic parrots: Can language models be too big?InProceedings of FAccT, pages 610–623, 2021\.
- \(5\)Emily Sheng, Kai\-Wei Chang, Premkumar Natarajan, and Nanyun Peng\.The woman worked as a babysitter: On biases in language generation\.InProceedings of EMNLP, pages 3407–3412, 2019\.
- \(6\)Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A\. Smith, and Yejin Choi\.DExperts: Decoding\-time controlled text generation with experts and anti\-experts\.InProceedings of ACL\-IJCNLP, pages 6691–6706, 2021\.
- \(7\)Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po\-Sen Huang\.Challenges in detoxifying language models\.InFindings of EMNLP, pages 2447–2469, 2021\.
- \(8\)Long Ouyang et al\.Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2022\.
- \(9\)Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu\.Plug and play language models: A simple approach to controlled text generation\.InProceedings of ICLR, 2020\.
- \(10\)Kevin Yang and Dan Klein\.FUDGE: Controlled text generation with future discriminators\.InProceedings of NAACL, pages 3511–3535, 2021\.
- \(11\)Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar\.ToxiGen: A large\-scale machine\-generated dataset for adversarial and implicit hate speech detection\.InProceedings of ACL, pages 3309–3326, 2022\.
- \(12\)Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A\. Smith\.The risk of racial bias in hate speech detection\.InProceedings of ACL, pages 1668–1678, 2019\.
- \(13\)Perspective API\.Perspective API documentation\.[https://www\.perspectiveapi\.com/](https://www.perspectiveapi.com/), 2023\.
- \(14\)Xavier Suau, Pieter Delobelle, Katherine Metcalf, Armand Joulin, Nicholas Apostoloff, Luca Zappella, and Pau RodríguezWhispering Experts: Neural Interventions for Toxicity Mitigation in Language Models\.InProceedings of ACL, 2024\.
- \(15\)Heegyu Kim and Hyunsouk ChoGTA: Gated Toxicity Avoidance for LM Performance Preservation\.InFindings of EMNLP, 2023\.
- \(16\)Jiaxin Wen, Pei Ke, Hao Sun, Zhexin Zhang, Chengfei Li, Jinfeng Bai, Minlie HuangUnveiling the Implicit Toxicity in Large Language Models\.InProceedings of EMNLP, 2023\.
- \(17\)Sarthak Roy, Ashish Harshavardhan, Animesh Mukherjee, and Punyajoy SahaProbing LLMs for hate speech detection: strengths and vulnerabilities\.InFindings of EMNLP, 2023\.
- \(18\)Jingjie Zeng, Liang Yang, Zekun Wang, Yuanyuan Sun, and Hongfei LinSheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech?InProceedings of ACL, 2025\.
- \(19\)Isabel O\. Gallegos, Ryan A\. Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K\. AhmedBias and Fairness in Large Language Models: A Survey\.Computational Linguistics, volume 50, pages 1097–1179, 2024\.
- \(20\)Tinh Son Luong, Thanh\-Thien Le, Linh Ngo Van, and Thien Huu NguyenRealistic Evaluation of Toxicity in Large Language Models\.InFindings of ACL, 2024\.
- \(21\)Mengru Wang et al\.Detoxifying Large Language Models via Knowledge Editing\.arXiv preprint arXiv:2403\.14472, 2024\.
- \(22\)Devansh Jain, Priyanshu Kumar, Samuel Gehman, Xuhui Zhou, Thomas Hartvigsen, Maarten SapPolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models\.arXiv preprint arXiv:2405\.09373, 2024\.
- \(23\)Xun Liang, Hanyu Wang, Yezhaohui Wang, Shichao Song, Jiawei Yang, Simin Niu, Jie Hu, Dan Liu, Shunyu Yao, Feiyu Xiong, and Zhiyu LiControllable Text Generation for Large Language Models: A Survey\.arXiv preprint arXiv:2408\.12599, 2024\.
- \(24\)Yuntao Bai et al\.Constitutional AI: Harmlessness from AI Feedback\.arXiv preprint arXiv:2212\.08073, 2022\.
- \(25\)Luiza Pozzobon, Patrick Lewis, Sara Hooker, Beyza ErmisFrom One to Many: Expanding the Scope of Toxicity Mitigation in Language Models\.InFindings of ACL, 2024\.
- \(26\)Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J\. Pappas, and Eric Wong\.JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models\.arXiv preprint arXiv:2404\.01318, 2024\.
- \(27\)Bohdan Turbal, Anastasiia Mazur, Jiaxu Zhao, and Mykola PechenizkiyOn Adversarial Robustness of Language Models in Transfer Learning\.arXiv preprint arXiv:2501\.00066, 2024\.

Similar Articles

Detoxification for LLM: From Dataset Itself

arXiv cs.CL

Researchers propose HSPD, a corpus-level detoxification pipeline that rewrites toxic spans in pretraining data while preserving semantics, achieving state-of-the-art toxicity reduction on GPT-2 XL, LLaMA-2, OPT, and Falcon models.

Lessons learned on language model safety and misuse

OpenAI Blog

OpenAI shares lessons learned on language model safety and misuse, discussing challenges in measuring risks, the limitations of existing benchmarks, and their development of new evaluation metrics for toxicity and policy violations. The post also highlights concerns about labor market impacts and the need for continued research on measuring social effects of AI deployment at scale.

DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training

arXiv cs.CL

DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.