Tag
Introduces a comprehensive hate speech dataset for Turkish and Arabic, and develops state-of-the-art BERT-based models for hate speech analysis including classification, intensity prediction, target identification, and span detection.
This paper finds that 42.6% of annotator disagreement in HateXplain concentrates at the hate/offensive boundary, demonstrating that majority vote silences minority values and leads to models being wrong but highly confident on contested inputs.
This paper studies hate speech cascades on Bluesky and uses multi-LLM agents to simulate them, finding that such simulations reproduce key patterns like stance monoculture and toxicity-delta direction, and that amplifier targeting on dense networks yields 7.5–12.9% reduction in hateful content with low benign collateral.
A new report from the Center for Countering Digital Hate (CCDH) reveals that racist comments targeting politicians tripled after Meta relaxed its content moderation rules, with violent threats and hate speech quadrupling and bullying doubling.
New research from the Center for Countering Digital Hate shows that abusive comments, violent threats, and hate speech against US lawmakers on Facebook tripled or quadrupled in the six months after Meta relaxed its speech rules in early 2025.
Elon Musk highlights Grok's response to a user who copied Gemini's analysis of a Belgian hate speech conviction and asked Grok to reply.
This paper studies the use of large language models to assist expert counterspeech writing when hate speech and misinformation co-occur, testing knowledge-driven strategies with human evaluation. The mixed strategy combining fact-checkers' and NGOs' guidelines proved most effective.
This replication study evaluates DExperts for mitigating toxicity in LLMs, finding near-perfect safety against explicit toxicity but reduced effectiveness against implicit hate speech and a significant latency trade-off.
Researchers from UCLA examine how automated content moderation tools, including Perspective API, fail to distinguish between reclaimed and hateful uses of slurs for LGBTQIA+, Black, and women communities. The study finds low inter-annotator agreement even among in-group members and poor alignment between community judgments and AI moderation tools, highlighting the need for context-sensitive approaches.