hate-speech

#hate-speech

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

arXiv cs.CL ↗ · 2d ago Cached

This replication study evaluates DExperts for mitigating toxicity in LLMs, finding near-perfect safety against explicit toxicity but reduced effectiveness against implicit hate speech and a significant latency trade-off.

0 favorites 0 likes

#hate-speech

IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from UCLA examine how automated content moderation tools, including Perspective API, fail to distinguish between reclaimed and hateful uses of slurs for LGBTQIA+, Black, and women communities. The study finds low inter-annotator agreement even among in-group members and poor alignment between community judgments and AI moderation tools, highlighting the need for context-sensitive approaches.

0 favorites 0 likes

hate-speech

Measuring and Mitigating Toxicity in Large Language Models: A Comprehensive Replication Study

IYKYK (But AI Doesn't): Automated Content Moderation Does Not Capture Communities' Heterogeneous Attitudes Towards Reclaimed Language

Submit Feedback