A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models
Summary
This survey synthesizes research on toxicity detection and detoxification for multilingual large language models, cataloging threat models, task formulations, detection approaches, and mitigation strategies, while identifying persistent challenges such as uneven language coverage and culturally contingent definitions of harm.
View Cached Full Text
Cached at: 06/25/26, 05:11 AM
# A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models Source: [https://arxiv.org/html/2606.25380](https://arxiv.org/html/2606.25380) Soham Dan1,Himanshu Beniwal2,Thomas Hartvigsen3 1Scale AI,2Indian Institute of Technology Gandhinagar,3University of Virginia soham\.dan@scale\.com, himanshubeniwal@iitgn\.ac\.in,hartvigsen@virginia\.edu ###### Abstract Large language models \(LLMs\) are increasingly deployed across languages, but their safety behavior remains uneven across linguistic and cultural contexts\. This survey synthesizes work on toxicity detection and detoxification for multilingual LLMs\. We first catalogue threat models that exploit language choice, translation pivots, code\-switching, orthographic variation, multi\-turn interaction, and post\-deployment fine\-tuning to weaken safety alignment\. We then organize task formulations \(toxic\-to\-neutral rewriting, toxicity classification, and toxic\-generation evaluation\), multilingual detection approaches \(cross\-lingual encoders, translation pipelines, representation\-level probes, and LLM\-based detectors\), and mitigation strategies spanning data filtering, supervised and preference\-based tuning, decoding\-time steering, representation editing, and multilingual guardrails\. Across these areas, we identify persistent challenges: uneven language coverage, culturally contingent definitions of harm, fragmented evaluation protocols, and the risk that detoxification suppresses legitimate dialectal or identity\-related expression\. A Survey of Toxicity Detection and Mitigation Strategies for Multilingual Language Models Soham Dan1, Himanshu Beniwal2, Thomas Hartvigsen31Scale AI,2Indian Institute of Technology Gandhinagar,3University of Virginiasoham\.dan@scale\.com, himanshubeniwal@iitgn\.ac\.in,hartvigsen@virginia\.edu Figure 1:Taxonomy of multilingual toxicity threat models, task formulations, evaluation metrics, detection approaches, and mitigation strategies\.## 1Introduction Large language models \(LLMs\) are increasingly used in multilingual settings, powering applications ranging from multilingual chatbots to cross\-lingual content moderation\(de Wynteret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib97); Hartvigsenet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib96); Kimet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib48)\)\. As deployment expands, so do safety risks: LLMs can produce, amplify, or fail to detect toxic content such as hate speech, harassment, profanity, and identity\-based abuse, and these risks are not distributed uniformly across languages\(Röttgeret al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib88); Sharma and Bhalla,[2025](https://arxiv.org/html/2606.25380#bib.bib126); Deshpandeet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib65); Krasnodębskaet al\.,[2026](https://arxiv.org/html/2606.25380#bib.bib133)\)\. Despite substantial progress on English detoxification, multilingual detection and mitigation remain less mature, especially for low\-resource languages, dialects, code\-switched inputs, and culturally specific harms\(Beniwalet al\.,[2025a](https://arxiv.org/html/2606.25380#bib.bib47); Tiţa and Zubiaga,[2021](https://arxiv.org/html/2606.25380#bib.bib111); Dementievaet al\.,[2024a](https://arxiv.org/html/2606.25380#bib.bib25); Logachevaet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib54); de Wynteret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib97)\)\. #### The Complexity of Multilingual Toxicity\. Multilingual detoxification is not a direct translation of English safety protocols\(Neplenbroeket al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib84); Kumaret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib12)\)\. Toxicity ranges fromovertcategories, such as slurs, explicit insults, and profanity, toimplicitforms such as microaggressions, sarcasm, and toxic condescension, which are harder to annotate, detect, and mitigate\(Wenet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib45); Sapet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib125)\)\. Definitions of harm also vary by community: expressions that are benign, reclaimed, or dialectal in one context may be offensive in another\. Multilingual settings introduce additional technical vulnerabilities\. Code\-switching, transliteration, and mixed\-script inputs can weaken both detectors and refusal behavior\(Zhanget al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib128); Al Ghanimet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib20); Yooet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib21)\), while pretrained models can degenerate into toxic continuations from benign or ambiguous prompts\(Gehmanet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib58)\)\. These failures interact with broader cultural and social biases in generated language\(Vongpraditet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib44); Dammuet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib43)\)\. #### Failures of Current Approaches\. Traditional moderation systems rely heavily on keyword lists, rules, and supervised classifiers, which are brittle under paraphrase, obfuscation, dialectal variation, and context\-dependent meaning\(Kimet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib48); Huang,[2025](https://arxiv.org/html/2606.25380#bib.bib41)\)\. LLM alignment reduces many overt harms, but it does not transfer uniformly across languages: malicious prompts in lower\-resource languages are more likely to elicit unsafe responses\(Denget al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib2); Shenet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib8)\), and preference optimization or RLHF data remain concentrated in a small set of high\-resource languages\(Danget al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib119); Luet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib51)\)\. Preference tuning can transfer across languages, but transfer quality varies with representation alignment and language\-resource availability\(Liet al\.,[2024b](https://arxiv.org/html/2606.25380#bib.bib68); Neplenbroeket al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib84)\)\. Machine translation is not a universal fallback either: multilingual translation systems can introduce, amplify, or obscure toxicity through hallucination and data bias\(Costa\-Jussàet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib39)\)\. These failures make multilingual detoxification a problem of technical robustness, evaluation validity, and sociolinguistic coverage\(Adragnaet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib38); Cecchiniet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib37)\)\. This survey provides a focused overview of detoxification for multilingual LLMs, synthesizing recent work on detection and mitigation into a taxonomy of datasets, methods, and evaluation frameworks \(Figure[1](https://arxiv.org/html/2606.25380#S0.F1)\)\. Related surveys examine multilingual LLM safety broadly\(Krasnodębskaet al\.,[2026](https://arxiv.org/html/2606.25380#bib.bib133)\); our emphasis is the narrower detoxification pipeline: how toxic behavior is induced, measured, detected, and mitigated across languages\. #### Scope and Contributions\. The survey is organized around the following themes: - •We organize multilingual threat models covering language\-shift jailbreaks, translation/pivot attacks, code\-switch prompts, multilingual red\-teaming, and adaptation\-time safety collapse from cross\-lingual fine\-tuning \(§[2](https://arxiv.org/html/2606.25380#S2)\)\. - •We organize task formulations into three categories—toxic\-to\-neutral rewriting, toxicity classification, and toxic\-generation/prompt continuation—and survey the datasets and metrics used to evaluate each\. - •We survey multilingual toxicity detection methods, spanning encoder\- and decoder\-based transformers, translation\-based pipelines, representation\-level probing, and LLM\-based zero\-shot detection\. - •We present a mechanism\-based detoxification taxonomy covering data\-centric filtering, supervised and preference\-based tuning, decoding\-time steering, representation editing, and multilingual guardrails\. We conclude with a discussion of open challenges—cross\-lingual coverage gaps, cultural misalignment, evaluation fragmentation, and over\-suppression—and identify concrete directions for building globally safe and equitable LLMs\. ## 2Threat Models for Inducing Toxicity in Multilingual LLMs We focus on*multilingual\-specific*toxicity\-inducing threat models, which are adversarial procedures that exploit language choice, cross\-lingual transfer, or multilingual interaction to elicit toxic outputs from a safety\-aligned model\. In this survey, we treat safety vulnerabilities—jailbreaks, alignment bypass, red\-teaming—as the mechanisms through which toxic outputs are induced; “safety failure” and “toxicity elicitation” are thus two views of the same problem\. To make these threat models comparable, we use four diagnostic axes:\(i\)language composition \(monolingual vs\. code\-switched\),\(ii\)script composition \(standard vs\. mixed\-script/transliterated\),\(iii\)translation mediation \(direct vs\. pivot/round\-trip\), and\(iv\)cultural\-norm variation \(universal vs\. culturally contingent harm\)\. The subsections below instantiate these axes: language\-shift attacks isolate non\-English prompting; translation\-mediated attacks stress pivoting and round\-trip evaluation; code\-switching attacks combine language and script composition; and multilingual red\-teaming/adaptation attacks expose how these linguistic operators interact with culturally contingent safety policies\. ### 2\.1Prompt\-Space Multilingual Attacks #### Language\-Shift Jailbreaks\. This threat primarily tests*language composition*: a malicious or ambiguous request remains monolingual but is re\-expressed outside English\.Denget al\.\([2024](https://arxiv.org/html/2606.25380#bib.bib2)\)formalize \(i\)*unintentional*multilingual jailbreaks \(benign users prompting in underrepresented languages\) and \(ii\)*intentional*multilingual jailbreaks \(adversaries combining multilingual prompts with explicit malicious instructions\), and show substantially higher unsafe rates in lower\-resource languages\. #### Translation\-Mediated and Pivot Attacks\. This threat stresses*translation mediation*: an unsafe English prompt is translated into a target low\-resource language to increase compliance, then the response is translated back\.Shenet al\.\([2024](https://arxiv.org/html/2606.25380#bib.bib8)\)empirically demonstrate higher unsafe response rates for malicious prompts expressed in lower\-resource languages, motivating translation/pivot\-based red\-teaming\. Recent defenses that*re\-anchor*safety using English while enforcing target\-language outputs further underscore translation as a core failure mode in multilingual safety\(Zhanget al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib22)\)\. #### Language Mixing: Code\-Switching and Multi\-Language Mixtures\. This threat combines*language composition*with*script composition*, because multilingual prompts may mix languages, scripts, and transliterated forms within one context\.Yooet al\.\([2025](https://arxiv.org/html/2606.25380#bib.bib21)\)show that code\-switched red\-teaming queries can elicit unsafe behavior more effectively than monolingual attacks and introduce a synthesis framework \(CSRT\) to generate such queries at scale\. Complementarily,Upadhayay and Behzadan \([2024](https://arxiv.org/html/2606.25380#bib.bib3)\)propose the*Sandwich Attack*, a multi\-language mixture prompt that interleaves benign and adversarial segments across languages to induce harmful completions in a black\-box setting\. ### 2\.2Multilingual Red Teaming Red teaming operationalizes these axes by generating adversarial prompts and dialogues at scale, including culturally specific prompts whose harmfulness may not be captured by English\-centric policies\. Early work established manual and LM\-assisted red teaming methodologies\(Perezet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib5); Zhuoet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib36)\)\. Recent multilingual extensions explicitly target the multilingual capability envelope: CSRT generates code\-switched attacks\(Yooet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib21)\); Rainbow Teaming produces diverse open\-ended adversarial prompts and has been replicated/extended for Polish as a concrete non\-English safety stress test\(Samvelyanet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib6); Krasnodębskaet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib7)\); and MM\-ART automates*multi\-turn, multilingual*red teaming, showing vulnerability increases sharply with conversation length and is substantially underestimated by single\-turn English evaluation\(Singhaniaet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib4)\)\. ### 2\.3Post\-Deployment Adaptation Attacks #### Cross\-lingual Fine\-Tuning Attacks\. Aligned multilingual models are frequently customized via SFT/PEFT after deployment, creating an adaptation\-time attack surface where safety behavior can shift across languages and local norms\.Poppiet al\.\([2025](https://arxiv.org/html/2606.25380#bib.bib23)\)show that fine\-tuning on a small toxic dataset in*one*language can collapse safety across*other*languages \(cross\-lingual attack transfer\)\. Their Safety Information Localization \(SIL\) analysis suggests safety\-relevant parameters are partially language\-agnostic, enabling sparse updates to induce multilingual failure\. #### Jailbreaks via New\-Language Learning\. Even benign adaptation can be risky:Upadhayay and Behzadan \([2025](https://arxiv.org/html/2606.25380#bib.bib19)\)show that LoRA fine\-tuning to learn a low\-resource language—without harmful data—can nonetheless degrade refusal behavior, implying that multilingual expansion itself can destabilize safety guarantees\. Multilingual detoxification methods should therefore be evaluated not only on monolingual English prompts, but under compositions of multilingual operators \(translate/pivot, code\-switch, mixture prompts, transliteration\), multi\-turn interaction, and post\-deployment adaptation stress tests\. The threat models above motivate the task formulations, datasets, and metrics we discuss next\. ## 3Task Setup: Datasets and Metrics ### 3\.1Datasets Toxicity datasets can broadly be categorized into three tasks, each corresponding to a distinct evaluation goal: Toxic\-to\-Neutral Rewriting\.ParaDetox\(Logachevaet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib54)\)introduced more than 10K English toxic→\\rightarrowneutral paraphrase pairs\. Subsequent work explored cross\-lingual transfer for detoxification\(Moskovskiyet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib135); Dementievaet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib24)\), added a Hindi evaluation set\(Mukherjeeet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib93)\), and extended the ParaDetox collection pipeline to Russian, Ukrainian, and Spanish in MultiParaDetox\(Dementievaet al\.,[2024a](https://arxiv.org/html/2606.25380#bib.bib25)\)\. The TextDetox/PAN 2024 shared task and its COLING extension broadened the parallel detoxification setting to 9 languages: English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, and Amharic\(Dementievaet al\.,[2024b](https://arxiv.org/html/2606.25380#bib.bib94),[2025](https://arxiv.org/html/2606.25380#bib.bib1)\)\. SynthDetox\-M\(Moskovskiyet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib82),[2024](https://arxiv.org/html/2606.25380#bib.bib113)\)adds 16K high\-quality synthetic pairs across German, French, Spanish, and Russian via few\-shot LLM prompting\. APPDIA\(Atwellet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib79)\)and CAPP\(Somet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib95)\)provide discourse\- or dialogue\-aware parallel corpora for offensive\-content paraphrasing\. Toxic Text Detection\.Jigsaw’s English Toxic Comment and Unintended Bias tasks provide large\-scale comment\-level toxicity labels, while the multilingual Jigsaw task evaluates binary toxicity in Spanish, Italian, Turkish, French, Portuguese, and Russian using English\-labeled training data\(Jigsaw,[2018](https://arxiv.org/html/2606.25380#bib.bib134); Kivlichanet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib85)\)\. OffensEval covers English offensive\-language identification in 2019 and five languages in 2020 \(Arabic, Danish, English, Greek, and Turkish\)\(Zampieriet al\.,[2019](https://arxiv.org/html/2606.25380#bib.bib86),[2020](https://arxiv.org/html/2606.25380#bib.bib115)\); the related Toxic Spans task targets span\-level explanations in English\(Pavlopouloset al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib87)\)\. HateCheck provides functional tests for English hate\-speech detection, and Multilingual HateCheck extends this diagnostic framing to ten languages\(Röttgeret al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib88),[2022](https://arxiv.org/html/2606.25380#bib.bib89)\)\. HASOC and HatEval provide additional multilingual hate/offensive\-language benchmarks\(Mandlet al\.,[2019](https://arxiv.org/html/2606.25380#bib.bib91); Basileet al\.,[2019](https://arxiv.org/html/2606.25380#bib.bib90)\)\. LifeTox\(Kimet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib99)\)targets implicit toxicity in English advice\-seeking contexts, and ToxiGen\(Hartvigsenet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib96)\)provides 274K machine\-generated toxic and benign statements about protected groups\. Such classification datasets serve both toxicity evaluation\(Kohet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib64)\)and retrieval\-based detoxification\(Pozzobonet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib74)\)\. Non\-Toxic Text Continuation\.RealToxicityPrompts \(RTP\)\(Gehmanet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib58)\)provides 100K English web prompts scored by Perspective API and introduced common toxic\-generation metrics such as Expected Maximum Toxicity \(EMT\) and toxicity probability\. RTP\-LX\(de Wynteret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib97)\)extends this style of evaluation to 28 languages in the paper, with human\-transcreated prompts and native\-speaker annotations for harm categories such as bias, insult, identity attack, and microaggression\. PolygloToxicityPrompts \(PTP\)\(Jainet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib67)\)offers 425K naturally sourced prompts across 17 languages and reports that toxicity tends to increase as model size grows or language\-resource availability decreases\. FrenchToxicityPrompts\(Brun and Nikoulina,[2024](https://arxiv.org/html/2606.25380#bib.bib98)\)provides 50K French prompts\. TET\(Luonget al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib9)\)comprises 2,546 prompts filtered from 1M real\-world LLM interactions to expose latent toxic behaviors that can bypass safety mechanisms\.Deshpandeet al\.\([2023](https://arxiv.org/html/2606.25380#bib.bib65)\)further showed that persona\-based system prompts can amplify toxic degeneration\. Table 1:Taxonomy of toxicity datasets organized by task\.Sourceindicates whether data is human\-authored \(Natural\), machine\-translated \(Translated\), human\-transcreated \(Transcreated\), LLM\-generated \(Synthetic\), or assembled from multiple sources \(Mixed\)\. Sizes are rounded where appropriate and follow the cited paper or task release\. ### 3\.2Metrics Toxicity Detection MetricsOutputs are often scored by toxicity classifiers\. A common metric isstyle transfer accuracy \(STA\): the fraction of outputs that a classifier deems non\-toxic\. For example, models use RoBERTa\-based classifiers trained on Jigsaw to compute STA\(Dementievaet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib24)\)\. Other tools such as thePerspective API111[https://perspectiveapi\.com/](https://perspectiveapi.com/)provide continuous toxicity scores\. Detoxification systems are typically expected to improve STA or reduce toxicity scores while preserving meaning and fluency \(e\.g\., reducing toxic\-generation probability as inLiet al\.,[2024b](https://arxiv.org/html/2606.25380#bib.bib68)\)\. Content Preservation and FluencyTo ensure meaning is retained, similarity metrics are applied\. Popular choices includeBLEURT\(Sellamet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib27)\)orBERTScore\(Zhanget al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib136)\)to compare the detoxified output to the input or a reference\.Dementievaet al\.\([2023](https://arxiv.org/html/2606.25380#bib.bib24)\)adopt BLEURT for English content similarity \(SIM\) and LaBSE embeddings for Russian\.Fluencyis evaluated by the percentage of grammatical or fluent sentences, often via a language acceptability classifier \(e\.g\., a RoBERTa trained to recognize acceptability\)\(Dementievaet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib24); Logachevaet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib54)\)\. Combined metrics like the product of STA, SIM, and fluency are sometimes used to rank models\. Cross\-Lingual AlignmentWhen detox and translation happen together, one can also measure translation quality or cross\-lingual consistency\. For example, in simultaneous translation\+detox, one may compute BLEU\(Papineniet al\.,[2002](https://arxiv.org/html/2606.25380#bib.bib137)\)or COMET\(Reiet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib138)\)between the generated detoxified translation and a human reference\. In practice, cross\-lingual transfer effectiveness is often inferred from zero\-shot performance, or by correlating translated and original outputs\. Some work also uses source\-output embedding similarity as a proxy for semantic alignment\. Human EvaluationUltimately, manual judgments are key\. Human annotators typically rate detox outputs on\(1\)toxicity/style \(is the output non\-toxic/neutral?\);\(2\)content preservation \(does it retain the original meaning?\); and\(3\)fluency \(is the output natural?\)\. Human scores are used both to evaluate final systems and to calibrate or validate automatic metrics \(e\.g\. correlating BLEURT with meaning preservation\)\. ## 4Detection Detecting toxicity in multilingual settings is complicated by linguistic diversity, code\-mixing, dialectal variation, and culturally contingent definitions of harm\. ### 4\.1Multilingual Transformers Early toxicity detection relied on keyword lists and lexicon\-based classifiers, which lack contextual understanding and fail under paraphrase, obfuscation, and dialectal variation\. Deep contextual encoders such as mBERT and XLM\-R marked a significant advance, demonstrating that cross\-lingual representations can improve toxicity identification across languages\(Conneauet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib105); Tiţa and Zubiaga,[2021](https://arxiv.org/html/2606.25380#bib.bib111)\)\. These models benefit from shared subword vocabularies and multilingual pretraining, allowing transfer from high\-resource languages such as English to languages with less labeled toxicity data\. Nevertheless, performance remains uneven across scripts, dialects, and languages with limited pretraining resources\(Kanjirangatet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib114)\)\. The brittleness of subword tokenization under spelling variants, obfuscation, and script mixing has motivated byte\- and character\-level alternatives; the newer Perspective API, for example, uses a multilingual token\-free Charformer architecture for toxic\-content detection\(Leeset al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib110)\)\. ### 4\.2Translation\-Based Pipelines A parallel line of work explores translation\-based pipelines, where non\-English text is machine\-translated into English before being passed to an English toxicity classifier\(Bellet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib109)\)\. This strategy can be competitive because English detectors are comparatively mature, but it introduces error propagation, translation artifacts, and semantic drift, especially for dialectal, code\-mixed, or morphologically complex inputs\(Zampieriet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib115)\)\. Translation systems themselves can introduce or obscure toxic content, so translation is best treated as an evaluation or deployment design choice rather than a neutral preprocessing step\(Costa\-Jussàet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib39)\)\. ### 4\.3Representation\-Level Detection Recent research identifies linear toxic subspaces in language model embeddings\(Wanget al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib103); Duanet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib102)\), suggesting that toxicity\-related features can occupy identifiable directions in latent space\. Decomposing models into interpretable expert components can further isolate toxicity\-related behavior\(Shaiket al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib104)\)\. These findings motivate probing and attribution techniques that seek to locate where toxicity features are stored, with applications to both detection and mitigation\(Wanget al\.,[2024a](https://arxiv.org/html/2606.25380#bib.bib55); Goyalet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib130)\)\. ### 4\.4LLM\-Based Detection The emergence of instruction\-tuned LLMs has opened new detection avenues\(Huet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib108)\)\. Several works evaluate LLMs as zero\-shot or few\-shot toxicity detectors, showing strong generalization but also calibration failures\(Liuet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib116)\)and cultural misalignment across languages\(Yanget al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib18)\)\. These models often rely on implicit safety priors learned during alignment, which can produce inconsistent behavior on region\-specific sociolinguistic norms\. Takeaway: Multilingual LLMs and multilingual encoders have improved cross\-lingual toxicity detection, but substantial challenges remain\. Persistent gaps in language coverage, bias in training corpora, inconsistent cross\-lingual performance, and translation\-induced errors limit the reliability of current detectors\. See Table[2](https://arxiv.org/html/2606.25380#A1.T2)in the Appendix for a detailed comparison\. ## 5Detoxification ### 5\.1Data\-Centric Detoxification Data\-centric detoxification targets the quality of pre\-training and fine\-tuning corpora by removing or down\-weighting toxic content\. Early filtering pipelines relied on blocklists or lexical heuristics; contemporary pipelines often combine language identification, quality filters, and toxicity classifiers at web scale\(Kreutzeret al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib50); Stranisci and Hardmeier,[2025](https://arxiv.org/html/2606.25380#bib.bib122)\)\. More recent work emphasizes bias\-aware filtering to avoid suppressing dialectal or marginalized speech\(Sapet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib125); Jaggiet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib69); Xuet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib72)\)\. In multilingual settings, filtering depends heavily on cross\-lingual detector generalization, which can misclassify culturally specific idioms, reclaimed slurs, or dialectal markers\(Bensalemet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib52); Welblet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib124)\)\. Data filtering is scalable and can reduce exposure to toxic training examples, but it also risks cultural misalignment, uneven language coverage, and the over\-removal of minority language varieties; recent work argues that harmful\-content filtering can deepen underrepresentation of already vulnerable groups\(Stranisci and Hardmeier,[2025](https://arxiv.org/html/2606.25380#bib.bib122)\)\. ### 5\.2Model\-Centric Detoxification #### Supervised Finetuning on Safe or Contrastive Data Supervised detoxification approaches fine\-tune LLMs on curated non\-toxic corpora, contrastive toxic–neutral pairs, or attribute\-controlled toxicity objectives\(Hawkinset al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib117); Menget al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib92)\)\.Neplenbroeket al\.\([2025](https://arxiv.org/html/2606.25380#bib.bib84)\)report that mitigation can transfer across languages, but that transfer depends on language\-resource conditions and can trade off against non\-English generation quality\. Fine\-tuning\-based detoxification can provide strong control, but it may reduce output diversity, degrade generation quality, or introduce stylistic flattening\(Wanget al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib123); Welblet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib124)\)\. #### Instruction\-Based Safety Tuning Instruction tuning using curated safety data or synthetic refusal\-style instructions can enhance multilingual LLMs’ ability to decline harmful requests and avoid toxic continuations\. Multilingual preference optimization shows that alignment can transfer across languages when feedback data are balanced and sufficiently broad\(Danget al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib119)\)\. These methods scale well for deployment, though annotation biases and cultural coverage remain persistent limitations\. #### RLHF and Human Feedback Alignment Reinforcement learning from human feedback \(RLHF\)\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib34); Baiet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib49)\)can improve safety by training reward models to penalize toxic outputs\. While RLHF datasets are primarily English\-centric, multilingual LLMs can benefit indirectly through shared parameters and cross\-lingual transfer\(Danget al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib119)\)\. However, reliance on English safety norms introduces cross\-cultural misalignment in multilingual models\(Luet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib51)\), especially for expressions that are offensive in some cultures but neutral in others\. ### 5\.3Decoding\-Time Detoxification Post\-hoc methods avoid or minimize retraining by steering generation at inference\(Koet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib129)\)\. Classifier\-guided and expert\-based logit steering include PPLM hidden\-state perturbations\(Dathathriet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib131); Pascualet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib70)\), GeDi\-style generative discriminators\(Krauseet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib120)\), and expert/anti\-expert mixture decoding such as DExperts\(Liuet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib71)\)\. Expert steering is modular, but high\-quality multilingual experts are a bottleneck\. A second family uses*edit\-after\-generate*: produce a candidate, detect toxicity, and rewrite or refine it via prompting or a specialized editor\(Leonget al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib80)\)\. In multilingual deployments,*translation\-pivot pipelines*\(translate→\\rightarrowdetox in English→\\rightarrowtranslate back\) remain common, but they risk semantic drift and can erase culturally salient pragmatics\(Dementievaet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib24); Bellet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib109)\)\. Retrieval augmentation can also support detoxification by grounding rewrites in policy examples or safe templates\(Pozzobonet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib74)\)\. ### 5\.4Model Editing and Representation Interventions Recent work investigates activation steering: modifying internal LM representations to remove or attenuate toxic features\(Goyalet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib130)\)\. Activation Addition\(Turneret al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib31)\)and ROME\-based editing\(Menget al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib32)\)identify directions or associations that can be altered during generation\. Early analyses of how interventions reshape cross\-lingual representations\(Sundaret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib121)\)suggest potential for multilingual transfer, though evaluation is still nascent and regression risk remains high without careful cross\-lingual audits\(Wanget al\.,[2024a](https://arxiv.org/html/2606.25380#bib.bib55)\)\. ### 5\.5Multilingual Guardrails A related line of work–not the main focus of this survey–is post\-hoc moderation via multilingual guardrails\(Yiet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib10)\): deployment\-time controllers that classify and gate prompts and responses into policy categories such as prompt harmfulness, response harmfulness, and refusal/compliance under adversarial multilingual inputs\. Language choice, code\-switching, and transliteration can weaken English\-centric safeguards\. Representative guardrails and safety classifiers include Llama Guard\(Inanet al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib13)\), Aegis\(Ghoshet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib14)\), MrGuard\(Yanget al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib18)\), WildGuard\(Hanet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib16)\), PolyGuard\(Kumaret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib12)\), MultiGuard/OmniGuard\(Vermaet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib11)\), CREST\(Bansal and Mishra,[2025](https://arxiv.org/html/2606.25380#bib.bib17)\), Qwen3Guard\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib132)\), and UnityAI\-Guard\(Beniwalet al\.,[2025b](https://arxiv.org/html/2606.25380#bib.bib66)\)\. Key Takeaways\. - •Cross\-lingual robustness remains a central challenge: Detoxification methods often perform better in high\-resource languages than in low\-resource or morphologically rich languages\. - •Cultural bias persists across detoxification pipelines: Much safety supervision originates from English, creating misalignment in non\-Western contexts\. - •Hybrid strategies are promising: Combining data filtering, controlled decoding, alignment tuning, and guardrails can cover failure modes that no single method handles reliably\. - •Avoiding over\-censorship is an unresolved issue: Techniques often suppress legitimate emotional or dialectal expressions, leading to “model homogenization\.” See Table[3](https://arxiv.org/html/2606.25380#A1.T3)in the Appendix for a detailed comparison of detoxification techniques\. ## 6Discussion and Open Challenges ### 6\.1Cross\-Lingual Gaps in Detoxification A persistent disparity exists between high\-resource and low\-resource languages\. Many multilingual toxicity detectors and safety\-tuned LLMs are trained or validated primarily on English and other high\-resource languages, leaving morphologically rich, dialectal, or culturally distant varieties under\-detected\(Shenet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib8); Wanget al\.,[2024b](https://arxiv.org/html/2606.25380#bib.bib127); Bensalemet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib52)\)\. Alignment methods such as RLHF and constitutional tuning have historically relied on English\-heavy preference or principle data\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib34); Baiet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib49)\), which can produce inconsistent refusal behavior and weak recognition of non\-English toxic slang\(Luet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib51); Danget al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib119)\)\. Open Challenge: Developing culturally aware multilingual safety representations that scale to low\-resource languages without English over\-dominance remains essential\. ### 6\.2Cultural and Normative Misalignment Toxicity is culturally embedded: annotators’ identities and beliefs strongly influence judgments\(Sapet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib125),[2019](https://arxiv.org/html/2606.25380#bib.bib30); Jaggiet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib69)\), yet many safety datasets collapse disagreement into a single label\. Models therefore risk over\-censoring reclaimed slurs, misclassifying dialectal expressions, or reinforcing majority\-group norms\(Shenet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib8)\)\. Languages with rich honorific systems, code\-switching norms, or culturally specific humor\(Liet al\.,[2024a](https://arxiv.org/html/2606.25380#bib.bib107)\)expose current models’ limited ability to differentiate toxicity from socially sanctioned expression\. Open Challenge: Future systems need culturally grounded, community\-driven annotation and context\-aware toxicity modeling that respects sociolinguistic diversity\. ### 6\.3Lack of Robust, Multilingual Evaluation Frameworks A recurring theme is the lack of standardized, multilingual frameworks for evaluating toxicity\. Existing generation benchmarks such as RealToxicityPrompts\(Gehmanet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib58)\)are English\-only, while newer multilingual datasets such as RTP\-LX, PTP, and PolyGuard broaden coverage but differ substantially in task format, label schema, and language set\(de Wynteret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib97); Jainet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib67); Kumaret al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib12)\)\. Evaluation pipelines also struggle with subtle harms such as microaggressions, presuppositional harm, and implicit bias\(Sapet al\.,[2022](https://arxiv.org/html/2606.25380#bib.bib125)\)\. Cross\-lingual transfer of toxicity classifiers can produce false positives for dialects or false negatives for low\-resource slang, making direct comparison unreliable\. Open Challenge: The field needs multilingual benchmarks with fine\-grained toxicity categories, cross\-cultural annotations, and shared evaluation protocols\(Wanget al\.,[2024b](https://arxiv.org/html/2606.25380#bib.bib127)\)\. ### 6\.4Over\-Suppression and Style Degradation Detoxification techniques, particularly contrastive finetuning and representation editing, can reduce linguistic richness or stylistic diversity\. Prior work shows that detoxification can trade toxicity reduction for reduced fluency, reduced diversity, or suppression of identity\-related language\(Welblet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib124); Liuet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib71); Xuet al\.,[2021](https://arxiv.org/html/2606.25380#bib.bib72)\)\. In multilingual settings, this risk is amplified: low\-resource languages may be pushed toward generic, formal, or English\-like outputs because the model has weaker language\-specific representations\. Techniques such as activation editing\(Turneret al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib31)\)and PPLM\(Dathathriet al\.,[2020](https://arxiv.org/html/2606.25380#bib.bib131)\)offer fine\-grained control but still risk semantic over\-suppression when applied cross\-lingually\. Open Challenge: Designing detoxification techniques that preserve stylistic and cultural characteristics while eliminating harmful content remains an open frontier\. ### 6\.5Handling Code\-Switching and Mixed\-Linguistic Toxicity Multilingual communities frequently communicate through code\-switching \(e\.g\., Hinglish, Arabizi, Spanglish\), combining scripts, phonetic spellings, and culturally specific expressions\. Current LLMs and toxicity detectors are less reliable under code\-switching because training coverage, tokenization, and evaluation data are sparse for mixed\-language inputs\(Zhanget al\.,[2023](https://arxiv.org/html/2606.25380#bib.bib128); Bensalemet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib52)\)\. Safety failures under code\-switched or transliterated prompts have been demonstrated for red\-teaming and jailbreak settings\(Al Ghanimet al\.,[2024](https://arxiv.org/html/2606.25380#bib.bib20); Yooet al\.,[2025](https://arxiv.org/html/2606.25380#bib.bib21)\), and Hindi\-English toxic language remains an active detection problem\(Sharma and Bhalla,[2025](https://arxiv.org/html/2606.25380#bib.bib126)\)\. This poses serious risks for global deployments of multilingual LLMs\. Open Challenge: Robust multilingual safety systems must explicitly account for code\-switching and orthographic variation via code\-mixed training corpora, unified mixed\-script tokenizers, and transliteration\-aware detection\. ### 6\.6The Role of Decoding\-Time Steering Decoding\-time methods \(PPLM, GeDi, DExperts\) are best viewed as a*complementary*safeguard—modular and retraining\-light—not a standalone fix for root causes like English\-centric alignment data\. In multilingual settings, tokenization asymmetries, script mixing, and cross\-lingual semantic drift weaken expert model reliability; building language\-specific experts for low\-resource languages remains impractical at scale\. A central bottleneck isexpert availability\(data scarcity\), followed byrepresentation entanglement\(toxic directions conflating sentiment, intensity, and identity\) and cross\-script transfer instability\. The evidence points towardhybrid architectures: filtering and alignment tuning address root causes; steering provides inference\-time control; guardrails add system\-level robustness\. Where norms are culturally contingent, community\-grounded supervision remains necessary\. ### 6\.7Key Takeaways - •Language disparities: Methods effective in English often underperform in low\-resource languages and dialects\. - •Cultural context: One\-size\-fits\-all safety tuning misaligns with local norms, over\-censoring benign expressions or missing contextually offensive language\. - •Evaluation gaps: Fragmented protocols and English\-centric benchmarks make cross\-system comparison unreliable, especially for subtle toxicity\. - •Style trade\-offs: Detoxification often degrades output diversity, yielding generic text that erases linguistic richness\. - •Hybrid approaches: Combining data filtering, controlled generation, culturally aware alignment, and guardrails is the most defensible direction for deployment\. - •Interpretability: Understanding why models flag or generate toxic content is essential for trust and auditability in multilingual settings\. ## 7Conclusion This survey offers a focused treatment of detoxification for multilingual LLMs, a problem that remains under\-studied relative to its practical importance\. We systematized the space along three axes: multilingual threat models that expose how language shift, translation pivots, code\-switching, and post\-deployment adaptation erode safety; task formulations spanning rewriting, classification, and toxic\-generation evaluation; and a mechanism\-based taxonomy covering data filtering, supervised and preference\-based tuning, decoding\-time steering, representation editing, and guardrails\. Two findings cut across every axis\. First, cross\-lingual transfer of safety is unreliable: methods effective in English routinely under\-perform in low\-resource and morphologically rich languages, and alignment learned from English preference data can misfire when projected onto other cultural contexts\. Second, detoxification and linguistic diversity are in tension: current techniques can suppress legitimate dialectal, code\-switched, or identity\-related expression, trading one harm for another\. The most pressing research need is evaluation infrastructure: standardized, culturally grounded multilingual benchmarks that go beyond English\-translated prompts and that measure not only toxicity reduction but also preservation of stylistic and cultural content\. Without such benchmarks, progress on multilingual safety will remain difficult to measure and easy to overstate\. ## Limitations This survey synthesizes a fast\-moving literature, so specific model families, benchmarks, and best practices may evolve after publication\. Its scope is also intentionally focused on text\-based toxicity detection and detoxification for multilingual language models; we do not cover multimodal moderation, broader cyber\-safety policies, or legal governance in depth\. The evidence base is uneven across languages: many “multilingual” studies still emphasize English and other high\-resource languages, with fewer results for low\-resource languages, dialect continua, and code\-mixed or transliterated text\. Because toxicity definitions and label schemas vary across datasets and cultures, comparisons across papers are necessarily approximate\. We also do not run a quantitative meta\-analysis or reproduce prior experiments; our synthesis depends on reported results, which often use different models, datasets, detectors, and evaluation protocols\. Finally, many evaluations rely on automatic detectors, translation\-based protocols, or closed\-model assessments, which can introduce measurement noise and limit strict apples\-to\-apples replication\. ## Ethics This survey reviews prior work on toxicity in multilingual language models and does not involve new data collection, human\-subject annotation, or model deployment\. Because the paper discusses jailbreaks, red\-teaming, and adaptation\-time safety failures, the topic has some dual\-use risk\. We therefore keep the discussion at the level of threat models, evaluation categories, and mitigation strategies rather than providing operational attack instructions or harmful prompt examples\. The central ethical concern is that automated toxicity detection and detoxification can reflect English\-centric or majority\-culture norms, misclassify reclaimed or dialectal expressions, and suppress legitimate identity\-related speech\. Such failures can reinforce societal and annotator biases or lead to over\-censorship, especially for communities already underrepresented in training and evaluation data\. We therefore emphasize culturally grounded evaluation, inclusive data practices, transparent reporting of language coverage, and careful safety–utility trade\-offs in multilingual deployment\. ## References - Fairness and robustness in invariant learning: a case study in toxicity classification\.arXiv preprint arXiv:2011\.06485\.External Links:2011\.06485,[Link](https://arxiv.org/abs/2011.06485),[Document](https://dx.doi.org/10.48550/arXiv.2011.06485)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1)\. - M\. Al Ghanim, S\. Almohaimeed, M\. Zheng, Y\. Solihin, and Q\. Lou \(2024\)Jailbreaking llms with arabic transliteration and arabizi\.InProceedings of the 2024 conference on empirical methods in natural language processing,Miami, Florida, USA,pp\. 18584–18600\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1034/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1034)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1),[§6\.5](https://arxiv.org/html/2606.25380#S6.SS5.p1.1)\. - K\. Atwell, S\. Hassan, and M\. Alikhani \(2022\)APPDIA: a discourse\-aware transformer\-based style transfer model for offensive social media conversations\.InProceedings of the 29th International Conference on Computational Linguistics,Gyeongju, Republic of Korea,pp\. 6063–6074\.External Links:[Link](https://aclanthology.org/2022.coling-1.530/)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.1.1.1.1.1.1.1.1.2)\. - Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon, C\. Chen, C\. Olsson, C\. Olah, D\. Hernandez, D\. Drain, D\. Ganguli, D\. Li, E\. Tran\-Johnson, E\. Perez, J\. Kerr, J\. Mueller, J\. Ladish, J\. Landau, K\. Ndousse, K\. Lukosiute, L\. Lovitt, M\. Sellitto, N\. Elhage, N\. Schiefer, N\. Mercado, N\. DasSarma, R\. Lasenby, R\. Larson, S\. Ringer, S\. Johnston, S\. Kravec, S\. El Showk, S\. Fort, T\. Lanham, T\. Telleen\-Lawton, T\. Conerly, T\. Henighan, T\. Hume, S\. R\. Bowman, Z\. Hatfield\-Dodds, B\. Mann, D\. Amodei, N\. Joseph, S\. McCandlish, T\. Brown, and J\. Kaplan \(2022\)Constitutional ai: harmlessness from ai feedback\.External Links:2212\.08073,[Link](https://arxiv.org/abs/2212.08073)Cited by:[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2606.25380#S6.SS1.p1.1)\. - L\. Bansal and N\. Mishra \(2025\)CREST: universal safety guardrails through cluster\-guided cross\-lingual transfer\.arXiv preprint arXiv:2512\.02711\.Note:Accepted at LREC 2026External Links:2512\.02711,[Link](https://arxiv.org/abs/2512.02711),[Document](https://dx.doi.org/10.48550/arXiv.2512.02711)Cited by:[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - V\. Basile, C\. Bosco, E\. Fersini, D\. Nozza, V\. Patti, F\. M\. R\. Pardo, P\. Rosso, and M\. Sanguinetti \(2019\)Semeval\-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter\.InProceedings of the 13th international workshop on semantic evaluation,Minneapolis, Minnesota, USA,pp\. 54–63\.External Links:[Link](https://aclanthology.org/S19-2007/),[Document](https://dx.doi.org/10.18653/v1/S19-2007)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1)\. - S\. Bell, E\. Sánchez, D\. Dale, P\. Stenetorp, M\. Artetxe, and M\. R\. Costa\-jussà \(2025\)Translate, then detect: leveraging machine translation for cross\-lingual toxicity classification\.InProceedings of the Tenth Conference on Machine Translation,Suzhou, China,pp\. 253–268\.External Links:[Link](https://aclanthology.org/2025.wmt-1.15/),[Document](https://dx.doi.org/10.18653/v1/2025.wmt-1.15)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.3.2.5.1.1),[§4\.2](https://arxiv.org/html/2606.25380#S4.SS2.p1.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2)\. - H\. Beniwal, Y\. Kim, M\. Sap, S\. Dan, and T\. Hartvigsen \(2025a\)Breaking mbad\! supervised fine\-tuning for cross\-lingual detoxification\.External Links:2505\.16722,[Link](https://arxiv.org/abs/2505.16722)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.p1.1)\. - H\. Beniwal, R\. Venkat, R\. Kumar, B\. Srivibhav, D\. Jain, P\. Doddi, E\. Dhande, A\. Ananth, Kuldeep, and M\. Singh \(2025b\)UNITYAI\-guard: pioneering toxicity detection across low\-resource indian languages\.External Links:2503\.23088,[Link](https://arxiv.org/abs/2503.23088)Cited by:[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - I\. Bensalem, P\. Rosso, and H\. Zitouni \(2024\)Toxic language detection: a systematic review of arabic datasets\.External Links:2312\.07228,[Link](https://arxiv.org/abs/2312.07228)Cited by:[§5\.1](https://arxiv.org/html/2606.25380#S5.SS1.p1.1),[§6\.1](https://arxiv.org/html/2606.25380#S6.SS1.p1.1),[§6\.5](https://arxiv.org/html/2606.25380#S6.SS5.p1.1)\. - C\. Brun and V\. Nikoulina \(2024\)FrenchToxicityPrompts: a large benchmark for evaluating and mitigating toxicity in french texts\.InProceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying@ LREC\-COLING\-2024,pp\. 105–114\.External Links:[Link](https://aclanthology.org/2024.trac-1.12/)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.15.12.1)\. - D\. Cecchini, A\. Nazir, K\. Chakravarthy, and V\. Kocaman \(2024\)Holistic evaluation of large language models: assessing robustness, accuracy, and toxicity for real\-world applications\.InProceedings of the 4th Workshop on Trustworthy Natural Language Processing \(TrustNLP 2024\),Mexico City, Mexico,pp\. 109–117\.External Links:[Link](https://aclanthology.org/2024.trustnlp-1.11/),[Document](https://dx.doi.org/10.18653/v1/2024.trustnlp-1.11)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1)\. - A\. Conneau, K\. Khandelwal, N\. Goyal, V\. Chaudhary, G\. Wenzek, F\. Guzmán, E\. Grave, M\. Ott, L\. Zettlemoyer, and V\. Stoyanov \(2020\)Unsupervised cross\-lingual representation learning at scale\.InProceedings of the 58th annual meeting of the association for computational linguistics,Online,pp\. 8440–8451\.External Links:[Link](https://aclanthology.org/2020.acl-main.747/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.747)Cited by:[§4\.1](https://arxiv.org/html/2606.25380#S4.SS1.p1.1)\. - M\. R\. Costa\-Jussà, E\. Smith, C\. Ropers, D\. Licht, J\. Maillard, J\. Ferrando, and C\. Escolano \(2023\)Toxicity in multilingual machine translation at scale\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 9570–9586\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.642/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.642)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.3.2.5.1.1),[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.25380#S4.SS2.p1.1)\. - P\. P\. S\. Dammu, H\. Jung, A\. Singh, M\. Choudhury, and T\. Mitra \(2024\)“They are uncultured”: unveiling covert harms and social threats in llm generated conversations\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 20339–20369\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1134/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1134)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1)\. - J\. Dang, A\. Ahmadian, K\. Marchisio, J\. Kreutzer, A\. Üstün, and S\. Hooker \(2024\)RLHF can speak many languages: unlocking multilingual preference optimization for LLMs\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 13134–13156\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.729/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.729)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.3.2.4.1.1),[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2606.25380#S6.SS1.p1.1)\. - S\. Dathathri, A\. Madotto, J\. Lan, J\. Hung, E\. Frank, P\. Molino, J\. Yosinski, and R\. Liu \(2020\)Plug and play language models: a simple approach to controlled text generation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=H1edEyBKDS)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.4.3.4.1.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2),[§6\.4](https://arxiv.org/html/2606.25380#S6.SS4.p1.1)\. - A\. de Wynter, I\. Watts, T\. Wongsangaroonsri, M\. Zhang, N\. Farra, N\. E\. Altıntoprak, L\. Baur, S\. Claudet, P\. Gajdušek, Q\. Gu, A\. Kaminska, T\. Kaminski, R\. Kuo, A\. Kyuba, J\. Lee, K\. Mathur, P\. Merok, I\. Milovanović, N\. Paananen, V\. Paananen, A\. Pavlenko, B\. P\. Vidal, L\. I\. Strika, Y\. Tsao, D\. Turcato, O\. Vakhno, J\. Velcsov, A\. Vickers, S\. F\. Visser, H\. Widarmanto, A\. Zaikin, and S\. Chen \(2025\)RTP\-lx: can llms evaluate toxicity in multilingual scenarios?\.Proceedings of the AAAI Conference on Artificial Intelligence39\(27\),pp\. 27940–27950\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/35011),[Document](https://dx.doi.org/10.1609/aaai.v39i27.35011)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.13.10.1),[§6\.3](https://arxiv.org/html/2606.25380#S6.SS3.p1.1)\. - D\. Dementieva, N\. Babakov, and A\. Panchenko \(2024a\)MultiParaDetox: extending text detoxification with parallel data to new languages\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 124–140\.External Links:[Link](https://aclanthology.org/2024.naacl-short.12/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-short.12)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.2.1.4.1.1),[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.5.2.1)\. - D\. Dementieva, N\. Babakov, A\. Ronen, A\. A\. Ayele, N\. Rizwan, F\. Schneider, X\. Wang, S\. M\. Yimam, D\. A\. Moskovskiy, E\. Stakovskii, E\. Kaufman, A\. Elnagar, A\. Mukherjee, and A\. Panchenko \(2025\)Multilingual and explainable text detoxification with parallel corpora\.InProceedings of the 31st International Conference on Computational Linguistics,Abu Dhabi, UAE,pp\. 7998–8025\.Note:COLING 2025External Links:[Link](https://aclanthology.org/2025.coling-main.535/)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.2.1.4.1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1)\. - D\. Dementieva, D\. Moskovskiy, N\. Babakov, A\. A\. Ayele, N\. Rizwan, F\. Schneider, X\. Wang, S\. M\. Yimam, D\. Ustalov, E\. Stakovskii, A\. Smirnova, A\. Elnagar, A\. Mukherjee, and A\. Panchenko \(2024b\)Overview of the multilingual text detoxification task at pan 2024\.CEUR Workshop Proceedings3740,pp\. 2432–2461\.External Links:[Link](https://nchr.elsevierpure.com/en/publications/overview-of-the-multilingual-text-detoxification-task-at-pan-2024/)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.7.4.1)\. - D\. Dementieva, D\. Moskovskiy, D\. Dale, and A\. Panchenko \(2023\)Exploring methods for cross\-lingual text style transfer: the case of text detoxification\.InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),J\. C\. Park, Y\. Arase, B\. Hu, W\. Lu, D\. Wijaya, A\. Purwarianti, and A\. A\. Krisnadhi \(Eds\.\),Nusa Dua, Bali,pp\. 1083–1101\.External Links:[Link](https://aclanthology.org/2023.ijcnlp-main.70/),[Document](https://dx.doi.org/10.18653/v1/2023.ijcnlp-main.70)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p2.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2)\. - Y\. Deng, W\. Zhang, S\. J\. Pan, and L\. Bing \(2024\)Multilingual jailbreak challenges in large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/hash/6b396f766a50e0853a5164e68048540c-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2606.25380#S2.SS1.SSS0.Px1.p1.1)\. - A\. Deshpande, V\. Murahari, T\. Rajpurohit, A\. Kalyan, and K\. Narasimhan \(2023\)Toxicity in chatgpt: analyzing persona\-assigned language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 1236–1270\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.88/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.88)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p4.1)\. - Z\. Duan, Z\. Yin, Z\. Shi, L\. Pang, S\. Jing, J\. Wu, Y\. Yan, H\. Shen, and X\. Cheng \(2025\)GloSS over toxicity: understanding and mitigating toxicity in llms via global toxic subspace\.arXiv preprint arXiv:2505\.17078\.External Links:2505\.17078,[Link](https://arxiv.org/abs/2505.17078),[Document](https://dx.doi.org/10.48550/arXiv.2505.17078)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.5.4.5.1.1),[§4\.3](https://arxiv.org/html/2606.25380#S4.SS3.p1.1)\. - S\. Gehman, S\. Gururangan, M\. Sap, Y\. Choi, and N\. A\. Smith \(2020\)RealToxicityPrompts: evaluating neural toxic degeneration in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 3356–3369\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.301),[Link](https://aclanthology.org/2020.findings-emnlp.301)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.12.9.1),[§6\.3](https://arxiv.org/html/2606.25380#S6.SS3.p1.1)\. - S\. Ghosh, P\. Varshney, E\. Galinkin, and C\. Parisien \(2024\)AEGIS: online adaptive ai content safety moderation with ensemble of llm experts\.External Links:2404\.05993,[Link](https://arxiv.org/abs/2404.05993)Cited by:[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - A\. Goyal, V\. Rathi, W\. Yeh, Y\. Wang, Y\. Chen, and H\. Sundaram \(2025\)Breaking bad tokens: detoxification of LLMs using sparse autoencoders\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 12691–12709\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.641/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.641),ISBN 979\-8\-89176\-332\-6Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.6.5.4.1.1),[§4\.3](https://arxiv.org/html/2606.25380#S4.SS3.p1.1),[§5\.4](https://arxiv.org/html/2606.25380#S5.SS4.p1.1)\. - S\. Han, K\. Rao, A\. Ettinger, L\. Jiang, B\. Y\. Lin, N\. Lambert, Y\. Choi, and N\. Dziri \(2024\)Wildguard: open one\-stop moderation tools for safety risks, jailbreaks, and refusals of llms\.Advances in Neural Information Processing Systems37,pp\. 8093–8131\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/0f69b4b96a46f284b726fbd70f74fb3b-Abstract-Datasets_and_Benchmarks_Track.html)Cited by:[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - T\. Hartvigsen, S\. Gabriel, H\. Palangi, M\. Sap, D\. Ray, and E\. Kamar \(2022\)ToxiGen: a large\-scale machine\-generated dataset for adversarial and implicit hate speech detection\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Dublin, Ireland,pp\. 3309–3326\.External Links:[Link](https://aclanthology.org/2022.acl-long.234/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.234)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.11.8.1)\. - W\. Hawkins, B\. Mittelstadt, and C\. Russell \(2024\)The effect of fine\-tuning on language model toxicity\.arXiv preprint arXiv:2410\.15821\.External Links:2410\.15821,[Link](https://arxiv.org/abs/2410.15821),[Document](https://dx.doi.org/10.48550/arXiv.2410.15821)Cited by:[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px1.p1.1)\. - Z\. Hu, J\. Piet, G\. Zhao, J\. Jiao, and D\. Wagner \(2024\)Toxicity detection for free\.Advances in Neural Information Processing Systems37,pp\. 17518–17540\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2024/hash/1f69928210578f4cf5b538a8c8806798-Abstract-Conference.html)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.4.3.5.1.1),[§4\.4](https://arxiv.org/html/2606.25380#S4.SS4.p1.1)\. - T\. Huang \(2025\)Content moderation by llm: from accuracy to legitimacy\.Artificial Intelligence Review58\(10\),pp\. 320\.External Links:[Link](https://link.springer.com/article/10.1007/s10462-025-11328-1),[Document](https://dx.doi.org/10.1007/s10462-025-11328-1)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1)\. - H\. Inan, K\. Upasani, J\. Chi, R\. Rungta, K\. Iyer, Y\. Mao, M\. Tontchev, Q\. Hu, B\. Fuller, and D\. Testuggine \(2023\)Llama guard: llm\-based input\-output safeguard for human\-ai conversations\.arXiv preprint arXiv:2312\.06674\.External Links:2312\.06674,[Link](https://arxiv.org/abs/2312.06674),[Document](https://dx.doi.org/10.48550/arXiv.2312.06674)Cited by:[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - H\. Jaggi, K\. Coimbatore Murali, E\. Fleisig, and E\. Biyik \(2024\)Accurate and data\-efficient toxicity prediction when annotators disagree\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 21910–21917\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1221/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1221)Cited by:[§5\.1](https://arxiv.org/html/2606.25380#S5.SS1.p1.1),[§6\.2](https://arxiv.org/html/2606.25380#S6.SS2.p1.1)\. - D\. Jain, P\. Kumar, S\. Gehman, X\. Zhou, T\. Hartvigsen, and M\. Sap \(2024\)PolygloToxicityPrompts: multilingual evaluation of neural toxic degeneration in large language models\.arXiv preprintarXiv:2405\.09373\.Note:May 2024External Links:[Link](https://arxiv.org/abs/2405.09373)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.14.11.1),[§6\.3](https://arxiv.org/html/2606.25380#S6.SS3.p1.1)\. - Jigsaw \(2018\)Jigsaw toxic comment classification challenge\.Note:Kaggle competitionExternal Links:[Link](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.2.2)\. - V\. Kanjirangat, T\. Samardzic, L\. Dolamic, and F\. Rinaldi \(2025\)Tokenization and representation biases in multilingual models on dialectal nlp tasks\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 23992–24010\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1224/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1224)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.2.1.5.1.1),[§4\.1](https://arxiv.org/html/2606.25380#S4.SS1.p1.1)\. - M\. Kim, J\. Koo, H\. Lee, J\. Park, H\. Lee, and K\. Jung \(2024\)LifeTox: unveiling implicit toxicity in life advice\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),Mexico City, Mexico,pp\. 688–698\.External Links:[Link](https://aclanthology.org/2024.naacl-short.60/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-short.60)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.10.7.1)\. - Y\. Kim, H\. Beniwal, S\. L\. Johnson, and T\. Hartvigsen \(2025\)Decoding the rule book: extracting hidden moderation criteria from reddit communities\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 20487–20498\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1034/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1034)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.25380#S1.p1.1)\. - I\. Kivlichan, J\. Sorensen, J\. Elliott, L\. Vasserman, M\. Görner, and P\. Culliton \(2020\)Jigsaw multilingual toxic comment classification\.Note:Kaggle competitionExternal Links:[Link](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.2.2)\. - C\. Ko, P\. Chen, P\. Das, Y\. Mroueh, S\. Dan, G\. Kollias, S\. Chaudhury, T\. Pedapati, and L\. Daniel \(2024\)Large language models can be strong self\-detoxifiers\.External Links:2410\.03818,[Link](https://arxiv.org/abs/2410.03818)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.5.4.4.1.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2)\. - H\. Koh, D\. Kim, M\. Lee, and K\. Jung \(2024\)Can llms recognize toxicity? a structured investigation framework and toxicity metric\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 6092–6114\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.353),[Link](https://aclanthology.org/2024.findings-emnlp.353)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1)\. - A\. Krasnodębska, M\. Chrabaszcz, and W\. Kusa \(2025\)Rainbow\-teaming for the polish language: a reproducibility study\.InProceedings of the 5th Workshop on Trustworthy NLP \(TrustNLP 2025\),Albuquerque, New Mexico,pp\. 155–165\.External Links:[Link](https://aclanthology.org/2025.trustnlp-main.12/),[Document](https://dx.doi.org/10.18653/v1/2025.trustnlp-main.12)Cited by:[§2\.2](https://arxiv.org/html/2606.25380#S2.SS2.p1.1)\. - A\. Krasnodębska, K\. Dziewulska, K\. Seweryn, M\. Chrabaszcz, and W\. Kusa \(2026\)Safety of large language models beyond English: a systematic literature review of risks, biases, and safeguards\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Rabat, Morocco,pp\. 1003–1034\.External Links:[Link](https://aclanthology.org/2026.eacl-long.44/),[Document](https://dx.doi.org/10.18653/v1/2026.eacl-long.44)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.25380#S1.p1.1)\. - B\. Krause, A\. D\. Gotmare, B\. McCann, N\. S\. Keskar, S\. Joty, R\. Socher, and N\. F\. Rajani \(2021\)Gedi: generative discriminator guided sequence generation\.InFindings of the Association for Computational Linguistics: EMNLP 2021,Punta Cana, Dominican Republic,pp\. 4929–4952\.External Links:[Link](https://aclanthology.org/2021.findings-emnlp.424/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-emnlp.424)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.4.3.4.1.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2)\. - J\. Kreutzer, I\. Caswell, L\. Wang, A\. Wahab, D\. van Esch, N\. Ulzii\-Orshikh, A\. Tapo, N\. Subramani, A\. Sokolov, C\. Sikasote, M\. Setyawan, S\. Sarin, S\. Samb, B\. Sagot, C\. Rivera, A\. Rios, I\. Papadimitriou, S\. Osei, P\. O\. Suarez, I\. Orife, K\. Ogueji, A\. N\. Rubungo, T\. Q\. Nguyen, M\. Müller, A\. Müller, S\. H\. Muhammad, N\. Muhammad, A\. Mnyakeni, J\. Mirzakhalov, T\. Matangira, C\. Leong, N\. Lawson, S\. Kudugunta, Y\. Jernite, M\. Jenny, O\. Firat, B\. F\. P\. Dossou, S\. Dlamini, N\. de Silva, S\. Çabuk Ballı, S\. Biderman, A\. Battisti, A\. Baruwa, A\. Bapna, P\. Baljekar, I\. A\. Azime, A\. Awokoya, D\. Ataman, O\. Ahia, O\. Ahia, S\. Agrawal, and M\. Adeyemi \(2022\)Quality at a glance: an audit of web\-crawled multilingual datasets\.Transactions of the Association for Computational Linguistics10,pp\. 50–72\.External Links:ISSN 2307\-387X,[Link](http://dx.doi.org/10.1162/tacl_a_00447),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00447)Cited by:[§5\.1](https://arxiv.org/html/2606.25380#S5.SS1.p1.1)\. - P\. Kumar, D\. Jain, A\. Yerukola, L\. Jiang, H\. Beniwal, T\. Hartvigsen, and M\. Sap \(2025\)Polyguard: a multilingual safety moderation tool for 17 languages\.arXiv preprint arXiv:2504\.04377\.External Links:2504\.04377,[Link](https://arxiv.org/abs/2504.04377),[Document](https://dx.doi.org/10.48550/arXiv.2504.04377)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.7.6.4.1.1),[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1),[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1),[§6\.3](https://arxiv.org/html/2606.25380#S6.SS3.p1.1)\. - A\. Lees, V\. Q\. Tran, Y\. Tay, J\. Sorensen, J\. Gupta, D\. Metzler, and L\. Vasserman \(2022\)A new generation of perspective api: efficient multilingual character\-level transformers\.InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 3197–3207\.External Links:[Link](https://dl.acm.org/doi/10.1145/3534678.3539147),[Document](https://dx.doi.org/10.1145/3534678.3539147)Cited by:[§4\.1](https://arxiv.org/html/2606.25380#S4.SS1.p1.1)\. - C\. T\. Leong, Y\. Cheng, J\. Wang, J\. Wang, and W\. Li \(2023\)Self\-detoxifying language models via toxification reversal\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 4433–4449\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.269/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.269)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.5.4.4.1.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2)\. - C\. Li, M\. Chen, J\. Wang, S\. Sitaram, and X\. Xie \(2024a\)Culturellm: incorporating cultural differences into large language models\.Advances in Neural Information Processing Systems37,pp\. 84799–84838\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/9a16935bf54c4af233e25d998b7f4a2c-Paper-Conference.pdf)Cited by:[§6\.2](https://arxiv.org/html/2606.25380#S6.SS2.p1.1)\. - X\. Li, Z\. Yong, and S\. H\. Bach \(2024b\)Preference tuning for toxicity mitigation generalizes across languages\.arXiv preprintarXiv:2406\.16235\.Note:June 2024External Links:[Link](https://arxiv.org/abs/2406.16235)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.3.2.4.1.1),[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p1.1)\. - A\. Liu, M\. Sap, X\. Lu, S\. Swayamdipta, C\. Bhagavatula, N\. A\. Smith, and Y\. Choi \(2021\)DExperts: decoding\-time controlled text generation with experts and anti\-experts\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics \(ACL 2021\),Online,pp\. 1990–2001\.Note:ACL 2021External Links:[Link](https://aclanthology.org/2021.acl-long.522/)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.4.3.4.1.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2),[§6\.4](https://arxiv.org/html/2606.25380#S6.SS4.p1.1)\. - H\. Liu, H\. Huang, X\. Gu, H\. Wang, and Y\. Wang \(2025\)On calibration of llm\-based guard models for reliable content moderation\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=wUbum0nd9N)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.4.3.5.1.1),[§4\.4](https://arxiv.org/html/2606.25380#S4.SS4.p1.1)\. - V\. Logacheva, D\. Dementieva, S\. Ustyantsev, D\. Moskovskiy, D\. Dale, I\. Krotova, N\. Semenov, and A\. Panchenko \(2022\)ParaDetox: detoxification with parallel data\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 6804–6818\.External Links:[Link](https://aclanthology.org/2022.acl-long.469/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.469)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.2.1.4.1.1),[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.4.1.1)\. - H\. Lu, L\. Fang, R\. Zhang, X\. Li, J\. Cai, H\. Cheng, L\. Tang, Z\. Liu, Z\. Sun, T\. Wang, Y\. Zhang, A\. H\. Zidan, J\. Xu, J\. Yu, M\. Yu, H\. Jiang, X\. Gong, W\. Luo, B\. Sun, Y\. Chen, T\. Ma, S\. Wu, Y\. Zhou, J\. Chen, H\. Xiang, J\. Zhang, A\. Jahin, W\. Ruan, K\. Deng, Y\. Pan, P\. Wang, J\. Li, Z\. Liu, L\. Zhang, L\. Zhao, W\. Liu, D\. Zhu, X\. Xing, F\. Dou, W\. Zhang, C\. Huang, R\. Liu, M\. Zhang, Y\. Liu, X\. Sun, Q\. Lu, Z\. Xiang, W\. Zhong, T\. Liu, and P\. Ma \(2025\)Alignment and safety in large language models: safety mechanisms, training paradigms, and emerging challenges\.External Links:2507\.19672,[Link](https://arxiv.org/abs/2507.19672)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2606.25380#S6.SS1.p1.1)\. - T\. Luong, T\. Le, L\. Ngo, and T\. Nguyen \(2024\)Realistic evaluation of toxicity in large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 1038–1047\.External Links:[Link](https://aclanthology.org/2024.findings-acl.61/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.61)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p4.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.16.13.1)\. - T\. Mandl, S\. Modha, P\. Majumder, D\. Patel, M\. Dave, C\. Mandlia, and A\. Patel \(2019\)Overview of the hasoc track at fire 2019: hate speech and offensive content identification in indo\-european languages\.InProceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation,pp\. 14–17\.External Links:[Link](https://doi.org/10.1145/3368567.3368584),[Document](https://dx.doi.org/10.1145/3368567.3368584)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1)\. - K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 17359–17372\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html)Cited by:[§5\.4](https://arxiv.org/html/2606.25380#S5.SS4.p1.1)\. - T\. Meng, N\. Mehrabi, P\. Goyal, A\. Ramakrishna, A\. Galstyan, R\. Zemel, K\. Chang, R\. Gupta, and C\. Peris \(2024\)Attribute controlled fine\-tuning for large language models: a case study on detoxification\.Note:Amazon ScienceExternal Links:[Link](https://www.amazon.science/publications/attribute-controlled-fine-tuning-for-large-language-models-a-case-study-on-detoxification)Cited by:[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px1.p1.1)\. - D\. Moskovskiy, D\. Dementieva, and A\. Panchenko \(2022\)Exploring cross\-lingual text detoxification with large multilingual language models\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop,Dublin, Ireland,pp\. 346–354\.External Links:[Link](https://aclanthology.org/2022.acl-srw.26/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-srw.26)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1)\. - D\. Moskovskiy, S\. Pletenev, and A\. Panchenko \(2024\)LLMs to replace crowdsourcing for parallel data creation? the case of text detoxification\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 14361–14373\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.839/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.839)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1)\. - D\. Moskovskiy, N\. Sushko, S\. Pletenev, E\. Tutubalina, and A\. Panchenko \(2025\)SynthDetoxM: modern llms are few\-shot parallel detoxification data annotators\.arXiv preprint arXiv:2502\.06394\.External Links:2502\.06394,[Link](https://arxiv.org/abs/2502.06394),[Document](https://dx.doi.org/10.48550/arXiv.2502.06394)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.6.3.1)\. - S\. Mukherjee, A\. Bansal, A\. Kr\. Ojha, J\. P\. McCrae, and O\. Dusek \(2023\)Text detoxification as style transfer in English and Hindi\.InProceedings of the 20th International Conference on Natural Language Processing \(ICON\),J\. D\. Pawar and S\. Lalitha Devi \(Eds\.\),Goa University, Goa, India,pp\. 133–144\.External Links:[Link](https://aclanthology.org/2023.icon-1.13/)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1)\. - V\. Neplenbroek, A\. Bisazza, and R\. Fernández \(2025\)Cross\-lingual transfer of debiasing and detoxification in multilingual llms: an extensive investigation\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 2805–2830\.External Links:[Link](https://aclanthology.org/2025.findings-acl.145/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.145)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.3.2.4.1.1),[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px1.p1.1)\. - L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html)Cited by:[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px3.p1.1),[§6\.1](https://arxiv.org/html/2606.25380#S6.SS1.p1.1)\. - K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)BLEU: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,Philadelphia, Pennsylvania, USA,pp\. 311–318\.External Links:[Link](https://aclanthology.org/P02-1040/),[Document](https://dx.doi.org/10.3115/1073083.1073135)Cited by:[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p3.1)\. - D\. Pascual, B\. Egressy, C\. Meister, R\. Cotterell, and R\. Wattenhofer \(2021\)A plug\-and\-play method for controlled text generation\.InFindings of the Association for Computational Linguistics: EMNLP 2021,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Punta Cana, Dominican Republic,pp\. 3973–3997\.External Links:[Link](https://aclanthology.org/2021.findings-emnlp.334/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-emnlp.334)Cited by:[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2)\. - J\. Pavlopoulos, J\. Sorensen, L\. Laugier, and I\. Androutsopoulos \(2021\)SemEval\-2021 task 5: toxic spans detection\.InProceedings of the 15th International Workshop on Semantic Evaluation \(SemEval\-2021\),A\. Palmer, N\. Schneider, N\. Schluter, G\. Emerson, A\. Herbelot, and X\. Zhu \(Eds\.\),Online,pp\. 59–69\.External Links:[Link](https://aclanthology.org/2021.semeval-1.6/),[Document](https://dx.doi.org/10.18653/v1/2021.semeval-1.6)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1)\. - E\. Perez, S\. Huang, F\. Song, T\. Cai, R\. Ring, J\. Aslanides, A\. Glaese, N\. McAleese, and G\. Irving \(2022\)Red teaming language models with language models\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Abu Dhabi, United Arab Emirates,pp\. 3419–3448\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.225/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.225)Cited by:[§2\.2](https://arxiv.org/html/2606.25380#S2.SS2.p1.1)\. - S\. Poppi, Z\. Yong, Y\. He, B\. Chern, H\. Zhao, A\. Yang, and J\. Chi \(2025\)Towards understanding the fragility of multilingual llms against fine\-tuning attacks\.InFindings of the Association for Computational Linguistics: NAACL 2025,Albuquerque, New Mexico,pp\. 2358–2372\.External Links:[Link](https://aclanthology.org/2025.findings-naacl.126/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-naacl.126)Cited by:[§2\.3](https://arxiv.org/html/2606.25380#S2.SS3.SSS0.Px1.p1.1)\. - L\. Pozzobon, B\. Ermis, P\. Lewis, and S\. Hooker \(2023\)Goodtriever: adaptive toxicity mitigation with retrieval\-augmented models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 5108–5125\.Note:EMNLP Findings 2023External Links:[Link](https://aclanthology.org/2023.findings-emnlp.339/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.339)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.5.4.4.1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[§5\.3](https://arxiv.org/html/2606.25380#S5.SS3.p1.2)\. - R\. Rei, C\. Stewart, A\. C\. Farinha, and A\. Lavie \(2020\)COMET: a neural framework for MT evaluation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Online,pp\. 2685–2702\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.213/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.213)Cited by:[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p3.1)\. - P\. Röttger, H\. Seelawi, D\. Nozza, Z\. Talat, and B\. Vidgen \(2022\)Multilingual hatecheck: functional tests for multilingual hate speech detection models\.InProceedings of the Sixth Workshop on Online Abuse and Harms \(WOAH\),Seattle, Washington \(Hybrid\),pp\. 154–169\.External Links:[Link](https://aclanthology.org/2022.woah-1.15/),[Document](https://dx.doi.org/10.18653/v1/2022.woah-1.15)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.9.6.1)\. - P\. Röttger, B\. Vidgen, D\. Nguyen, Z\. Waseem, H\. Margetts, and J\. Pierrehumbert \(2021\)HateCheck: functional tests for hate speech detection models\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),Online,pp\. 41–58\.External Links:[Link](https://aclanthology.org/2021.acl-long.4/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.4)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.9.6.1)\. - M\. Samvelyan, S\. C\. Raparthy, A\. Lupu, E\. Hambro, A\. H\. Markosyan, M\. Bhatt, Y\. Mao, M\. Jiang, J\. Parker\-Holder, and J\. Foerster \(2024\)Rainbow teaming: open\-ended generation of diverse adversarial prompts\.Advances in Neural Information Processing Systems37,pp\. 69747–69786\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/8147a43d030b43a01020774ae1d3e3bb-Abstract-Conference.html),[Document](https://dx.doi.org/10.52202/079017-2229)Cited by:[§2\.2](https://arxiv.org/html/2606.25380#S2.SS2.p1.1)\. - M\. Sap, D\. Card, S\. Gabriel, Y\. Choi, and N\. A\. Smith \(2019\)The risk of racial bias in hate speech detection\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 1668–1678\.External Links:[Link](https://aclanthology.org/P19-1163/),[Document](https://dx.doi.org/10.18653/v1/P19-1163)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.2.1.5.1.1),[§6\.2](https://arxiv.org/html/2606.25380#S6.SS2.p1.1)\. - M\. Sap, S\. Swayamdipta, L\. Vianna, X\. Zhou, Y\. Choi, and N\. A\. Smith \(2022\)Annotators with attitudes: how annotator beliefs and identities bias toxic language detection\.InProceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies,Seattle, United States,pp\. 5884–5906\.External Links:[Link](https://aclanthology.org/2022.naacl-main.431/),[Document](https://dx.doi.org/10.18653/v1/2022.naacl-main.431)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.25380#S5.SS1.p1.1),[§6\.2](https://arxiv.org/html/2606.25380#S6.SS2.p1.1),[§6\.3](https://arxiv.org/html/2606.25380#S6.SS3.p1.1)\. - T\. Sellam, D\. Das, and A\. Parikh \(2020\)BLEURT: learning robust metrics for text generation\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Online,pp\. 7881–7892\.External Links:[Link](https://aclanthology.org/2020.acl-main.704/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.704)Cited by:[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p2.1)\. - Z\. H\. Shaik, A\. Mazhar, A\. Srivastava, and M\. S\. Akhtar \(2025\)Redefining experts: interpretable decomposition of language models for toxicity mitigation\.arXiv preprint arXiv:2509\.16660\.External Links:2509\.16660,[Link](https://arxiv.org/abs/2509.16660),[Document](https://dx.doi.org/10.48550/arXiv.2509.16660)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.5.4.5.1.1),[§4\.3](https://arxiv.org/html/2606.25380#S4.SS3.p1.1)\. - A\. Sharma and R\. Bhalla \(2025\)Detecting hate speech for hindi\-english code\-mix text data using dual contrastive learning\.Procedia Computer Science259,pp\. 35–43\.External Links:[Link](https://www.sciencedirect.com/science/article/pii/S1877050925010488),[Document](https://dx.doi.org/10.1016/j.procs.2025.03.304)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§6\.5](https://arxiv.org/html/2606.25380#S6.SS5.p1.1)\. - L\. Shen, W\. Tan, S\. Chen, Y\. Chen, J\. Zhang, H\. Xu, B\. Zheng, P\. Koehn, and D\. Khashabi \(2024\)The language barrier: dissecting safety challenges of llms in multilingual contexts\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 2668–2680\.External Links:[Link](https://aclanthology.org/2024.findings-acl.156/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.156)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px2.p1.1),[§2\.1](https://arxiv.org/html/2606.25380#S2.SS1.SSS0.Px2.p1.1),[§6\.1](https://arxiv.org/html/2606.25380#S6.SS1.p1.1),[§6\.2](https://arxiv.org/html/2606.25380#S6.SS2.p1.1)\. - A\. Singhania, C\. Dupuy, S\. S\. Mangale, and A\. Namboori \(2025\)Multi\-lingual multi\-turn automated red teaming for llms\.InProceedings of the 5th Workshop on Trustworthy NLP \(TrustNLP 2025\),Albuquerque, New Mexico,pp\. 141–154\.External Links:[Link](https://aclanthology.org/2025.trustnlp-main.11/),[Document](https://dx.doi.org/10.18653/v1/2025.trustnlp-main.11)Cited by:[§2\.2](https://arxiv.org/html/2606.25380#S2.SS2.p1.1)\. - A\. Som, K\. Sikka, H\. Gent, A\. Divakaran, A\. Kathol, and D\. Vergyri \(2024\)Demonstrations are all you need: advancing offensive content paraphrasing using in\-context learning\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12612–12627\.External Links:[Link](https://aclanthology.org/2024.findings-acl.749/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.749)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p2.1)\. - M\. A\. Stranisci and C\. Hardmeier \(2025\)What are they filtering out? a survey of filtering strategies for harm reduction in pretraining datasets\.arXiv preprint arXiv:2503\.05721\.External Links:2503\.05721,[Link](https://arxiv.org/abs/2503.05721),[Document](https://dx.doi.org/10.48550/arXiv.2503.05721)Cited by:[§5\.1](https://arxiv.org/html/2606.25380#S5.SS1.p1.1)\. - A\. Sundar, S\. Williamson, K\. Metcalf, B\. Theobald, S\. Seto, and M\. Fedzechkina \(2025\)Steering into new embedding spaces: analyzing cross\-lingual alignment induced by model interventions in multilingual language models\.arXiv preprint arXiv:2502\.15639\.External Links:2502\.15639,[Link](https://arxiv.org/abs/2502.15639),[Document](https://dx.doi.org/10.48550/arXiv.2502.15639)Cited by:[§5\.4](https://arxiv.org/html/2606.25380#S5.SS4.p1.1)\. - T\. Tiţa and A\. Zubiaga \(2021\)Cross\-lingual hate speech detection using transformer models\.arXiv preprint arXiv:2111\.00981\.External Links:2111\.00981,[Link](https://arxiv.org/abs/2111.00981),[Document](https://dx.doi.org/10.48550/arXiv.2111.00981)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.2.1.5.1.1),[§1](https://arxiv.org/html/2606.25380#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.25380#S4.SS1.p1.1)\. - A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2024\)Steering language models with activation engineering\.External Links:2308\.10248,[Link](https://arxiv.org/abs/2308.10248)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.6.5.4.1.1),[§5\.4](https://arxiv.org/html/2606.25380#S5.SS4.p1.1),[§6\.4](https://arxiv.org/html/2606.25380#S6.SS4.p1.1)\. - B\. Upadhayay and V\. Behzadan \(2024\)Sandwich attack: multi\-language mixture adaptive attack on llms\.InProceedings of the 4th Workshop on Trustworthy Natural Language Processing \(TrustNLP 2024\),Mexico City, Mexico,pp\. 208–226\.External Links:[Link](https://aclanthology.org/2024.trustnlp-1.18/),[Document](https://dx.doi.org/10.18653/v1/2024.trustnlp-1.18)Cited by:[§2\.1](https://arxiv.org/html/2606.25380#S2.SS1.SSS0.Px3.p1.1)\. - B\. Upadhayay and V\. Behzadan \(2025\)Tongue\-tied: breaking LLMs safety through new language learning\.InProceedings of the 7th Workshop on Computational Approaches to Linguistic Code\-Switching,G\. I\. Winata, S\. Kar, M\. Zhukova, T\. Solorio, X\. Ai, I\. Hamed, M\. K\. K\. Ihsani, D\. T\. Wijaya, and G\. Kuwanto \(Eds\.\),Albuquerque, New Mexico, USA,pp\. 32–47\.External Links:[Link](https://aclanthology.org/2025.calcs-1.5/),[Document](https://dx.doi.org/10.18653/v1/2025.calcs-1.5),ISBN 979\-8\-89176\-053\-0Cited by:[§2\.3](https://arxiv.org/html/2606.25380#S2.SS3.SSS0.Px2.p1.1)\. - S\. Verma, K\. Hines, J\. Bilmes, C\. Siska, L\. Zettlemoyer, H\. Gonen, and C\. Singh \(2025\)MULTIGUARD: an efficient approach for ai safety moderation across languages and modalities\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 16173–16187\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.819/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.819)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.7.6.4.1.1),[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - P\. Vongpradit, A\. Imsombut, S\. Kongyoung, C\. Damrongrat, S\. Phaholphinyo, and T\. Tanawong \(2024\)SafeCultural: a dataset for evaluating safety and cultural sensitivity in large language models\.In2024 8th International Conference on Information Technology \(InCIT\),pp\. 740–745\.External Links:[Document](https://dx.doi.org/10.1109/InCIT63192.2024.10810548)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1)\. - A\. Wang, M\. Sudhakar, and Y\. Ji \(2021\)Simple text detoxification by identifying a linear toxic subspace in language model embeddings\.arXiv preprint arXiv:2112\.08346\.External Links:2112\.08346,[Link](https://arxiv.org/abs/2112.08346),[Document](https://dx.doi.org/10.48550/arXiv.2112.08346)Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.5.4.5.1.1),[§4\.3](https://arxiv.org/html/2606.25380#S4.SS3.p1.1)\. - B\. Wang, W\. Ping, C\. Xiao, P\. Xu, M\. Patwary, M\. Shoeybi, B\. Li, A\. Anandkumar, and B\. Catanzaro \(2022\)Exploring the limits of domain\-adaptive training for detoxifying large\-scale language models\.Advances in Neural Information Processing Systems35,pp\. 35811–35824\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/e8c20cafe841cba3e31a17488dc9c3f1-Abstract-Conference.html)Cited by:[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px1.p1.1)\. - M\. Wang, N\. Zhang, Z\. Xu, Z\. Xi, S\. Deng, Y\. Yao, Q\. Zhang, L\. Yang, J\. Wang, and H\. Chen \(2024a\)Detoxifying large language models via knowledge editing\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 3093–3118\.External Links:[Link](https://aclanthology.org/2024.acl-long.171/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.171)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.6.5.4.1.1),[§4\.3](https://arxiv.org/html/2606.25380#S4.SS3.p1.1),[§5\.4](https://arxiv.org/html/2606.25380#S5.SS4.p1.1)\. - W\. Wang, Z\. Tu, C\. Chen, Y\. Yuan, J\. Huang, W\. Jiao, and M\. Lyu \(2024b\)All languages matter: on the multilingual safety of llms\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 5865–5877\.External Links:[Link](https://aclanthology.org/2024.findings-acl.349/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.349)Cited by:[§6\.1](https://arxiv.org/html/2606.25380#S6.SS1.p1.1),[§6\.3](https://arxiv.org/html/2606.25380#S6.SS3.p2.1)\. - J\. Welbl, A\. Glaese, J\. Uesato, S\. Dathathri, J\. Mellor, L\. A\. Hendricks, K\. Anderson, P\. Kohli, B\. Coppin, and P\. Huang \(2021\)Challenges in detoxifying language models\.InFindings of the Association for Computational Linguistics: EMNLP 2021,Punta Cana, Dominican Republic,pp\. 2447–2469\.External Links:[Link](https://aclanthology.org/2021.findings-emnlp.210/),[Document](https://dx.doi.org/10.18653/v1/2021.findings-emnlp.210)Cited by:[§5\.1](https://arxiv.org/html/2606.25380#S5.SS1.p1.1),[§5\.2](https://arxiv.org/html/2606.25380#S5.SS2.SSS0.Px1.p1.1),[§6\.4](https://arxiv.org/html/2606.25380#S6.SS4.p1.1)\. - J\. Wen, P\. Ke, H\. Sun, Z\. Zhang, C\. Li, J\. Bai, and M\. Huang \(2023\)Unveiling the implicit toxicity in large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 1322–1338\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.84/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.84)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1)\. - A\. Xu, E\. Pathak, E\. Wallace, S\. Gururangan, M\. Sap, and D\. Klein \(2021\)Detoxifying language models risks marginalizing minority voices\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Online,pp\. 2390–2397\.External Links:[Link](https://aclanthology.org/2021.naacl-main.190/),[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.190)Cited by:[§5\.1](https://arxiv.org/html/2606.25380#S5.SS1.p1.1),[§6\.4](https://arxiv.org/html/2606.25380#S6.SS4.p1.1)\. - Y\. Yang, S\. Dan, S\. Li, D\. Roth, and I\. Lee \(2025\)MrGuard: a multilingual reasoning guardrail for universal LLM safety\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 27377–27396\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1392/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1392),ISBN 979\-8\-89176\-332\-6Cited by:[Table 2](https://arxiv.org/html/2606.25380#A1.T2.1.1.4.3.5.1.1),[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.7.6.4.1.1),[§4\.4](https://arxiv.org/html/2606.25380#S4.SS4.p1.1),[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - D\. Yi, R\. Mu, G\. Jin, Y\. Qi, J\. Hu, X\. Zhao, J\. Meng, W\. Ruan, and X\. Huang \(2024\)Position: building guardrails for large language models requires systematic design\.InForty\-first International Conference on Machine Learning,pp\. 11375–11394\.External Links:[Link](https://proceedings.mlr.press/v235/dong24c.html)Cited by:[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - H\. Yoo, Y\. Yang, and H\. Lee \(2025\)Code\-switching red\-teaming: llm evaluation for safety and multilingual understanding\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 13392–13413\.External Links:[Link](https://aclanthology.org/2025.acl-long.657/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.657)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1),[§2\.1](https://arxiv.org/html/2606.25380#S2.SS1.SSS0.Px3.p1.1),[§2\.2](https://arxiv.org/html/2606.25380#S2.SS2.p1.1),[§6\.5](https://arxiv.org/html/2606.25380#S6.SS5.p1.1)\. - M\. Zampieri, S\. Malmasi, P\. Nakov, S\. Rosenthal, N\. Farra, and R\. Kumar \(2019\)SemEval\-2019 task 6: identifying and categorizing offensive language in social media \(offenseval\)\.InProceedings of the 13th International Workshop on Semantic Evaluation,Minneapolis, Minnesota, USA,pp\. 75–86\.External Links:[Link](https://aclanthology.org/S19-2010/),[Document](https://dx.doi.org/10.18653/v1/S19-2010)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.8.5.1)\. - M\. Zampieri, P\. Nakov, S\. Rosenthal, P\. Atanasova, G\. Karadzhov, H\. Mubarak, L\. Derczynski, Z\. Pitenis, and Ç\. Çöltekin \(2020\)SemEval\-2020 task 12: multilingual offensive language identification in social media \(OffensEval 2020\)\.InProceedings of the Fourteenth Workshop on Semantic Evaluation,A\. Herbelot, X\. Zhu, A\. Palmer, N\. Schneider, J\. May, and E\. Shutova \(Eds\.\),Barcelona \(online\),pp\. 1425–1447\.External Links:[Link](https://aclanthology.org/2020.semeval-1.188/),[Document](https://dx.doi.org/10.18653/v1/2020.semeval-1.188)Cited by:[§3\.1](https://arxiv.org/html/2606.25380#S3.SS1.p3.1),[Table 1](https://arxiv.org/html/2606.25380#S3.T1.2.2.2.2.2.2.2.8.5.1),[§4\.2](https://arxiv.org/html/2606.25380#S4.SS2.p1.1)\. - R\. Zhang, S\. Cahyawijaya, J\. C\. B\. Cruz, G\. I\. Winata, and A\. F\. Aji \(2023\)Multilingual large language models are not \(yet\) code\-switchers\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 12567–12582\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.774/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.774)Cited by:[§1](https://arxiv.org/html/2606.25380#S1.SS0.SSS0.Px1.p1.1),[§6\.5](https://arxiv.org/html/2606.25380#S6.SS5.p1.1)\. - T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi \(2020\)BERTScore: evaluating text generation with BERT\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SkeHuCVFDr)Cited by:[§3\.2](https://arxiv.org/html/2606.25380#S3.SS2.p2.1)\. - Z\. Zhang, Y\. Guo, J\. Lin, S\. Quan, H\. Zhang, and D\. Zhao \(2025\)English as defense proxy: mitigating multilingual jailbreak via eliciting english safety knowledge\.InFindings of the Association for Computational Linguistics: EMNLP 2025,Suzhou, China,pp\. 1185–1196\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.62/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.62)Cited by:[§2\.1](https://arxiv.org/html/2606.25380#S2.SS1.SSS0.Px2.p1.1)\. - H\. Zhao, C\. Yuan, F\. Huang, X\. Hu, Y\. Zhang, A\. Yang, B\. Yu, D\. Liu, J\. Zhou, J\. Lin, B\. Yang, C\. Cheng, J\. Tang, J\. Jiang, J\. Zhang, J\. Xu, M\. Yan, M\. Sun, P\. Zhang, P\. Xie, Q\. Tang, Q\. Zhu, R\. Zhang, S\. Wu, S\. Zhang, T\. He, T\. Tang, T\. Xia, W\. Liao, W\. Shen, W\. Yin, W\. Zhou, W\. Yu, X\. Wang, X\. Deng, X\. Xu, X\. Zhang, Y\. Liu, Y\. Li, Y\. Zhang, Y\. Jiang, Y\. Wan, and Y\. Zhou \(2025\)Qwen3Guard technical report\.arXiv preprint arXiv:2510\.14276\.External Links:2510\.14276,[Link](https://arxiv.org/abs/2510.14276),[Document](https://dx.doi.org/10.48550/arXiv.2510.14276)Cited by:[Table 3](https://arxiv.org/html/2606.25380#A1.T3.1.1.7.6.4.1.1),[§5\.5](https://arxiv.org/html/2606.25380#S5.SS5.p1.1)\. - T\. Y\. Zhuo, Y\. Huang, C\. Chen, and Z\. Xing \(2023\)Red teaming chatgpt via jailbreaking: bias, robustness, reliability and toxicity\.arXiv preprint arXiv:2301\.12867\.External Links:2301\.12867,[Link](https://arxiv.org/abs/2301.12867),[Document](https://dx.doi.org/10.48550/arXiv.2301.12867)Cited by:[§2\.2](https://arxiv.org/html/2606.25380#S2.SS2.p1.1)\. ## Appendix ADetection and Detoxification Comparisons Table 2:Comparison of multilingual toxicity detection approaches\.Table 3:Comparison of multilingual LLM detoxification and moderation techniques\.
Similar Articles
Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study
This replication study evaluates DExperts for mitigating toxicity in LLMs, finding near-perfect safety against explicit toxicity but reduced effectiveness against implicit hate speech and a significant latency trade-off.
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A comprehensive survey reviewing the trustworthiness challenges of Large Audio Language Models (LALMs), including vulnerabilities like cross-modal jailbreaking and acoustic backdoors, and proposing a defense-in-depth roadmap.
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
The paper introduces CITA, a framework for generating implicit toxicity attacks in Chinese to evaluate and improve LLM toxicity detectors, finding high attack success rates across tested models.
PSK@EEUCA 2026: Fine-Tuning Large Language Models with Synthetic Data Augmentation for Multi-Class Toxicity Detection in Gaming Chat
This paper presents a system for the EEUCA 2026 shared task on toxicity detection in gaming chat, achieving 4th place by fine-tuning Llama 3.1 8B with synthetic data augmentation. It highlights a 'validation trap' phenomenon where high validation scores do not correlate with test performance due to dataset distribution shifts.
Detoxification for LLM: From Dataset Itself
Researchers propose HSPD, a corpus-level detoxification pipeline that rewrites toxic spans in pretraining data while preserving semantics, achieving state-of-the-art toxicity reduction on GPT-2 XL, LLaMA-2, OPT, and Falcon models.