Tag
This replication study evaluates DExperts for mitigating toxicity in LLMs, finding near-perfect safety against explicit toxicity but reduced effectiveness against implicit hate speech and a significant latency trade-off.
Researchers from UCLA examine how automated content moderation tools, including Perspective API, fail to distinguish between reclaimed and hateful uses of slurs for LGBTQIA+, Black, and women communities. The study finds low inter-annotator agreement even among in-group members and poor alignment between community judgments and AI moderation tools, highlighting the need for context-sensitive approaches.