Tag
This replication study evaluates DExperts for mitigating toxicity in LLMs, finding near-perfect safety against explicit toxicity but reduced effectiveness against implicit hate speech and a significant latency trade-off.
A Stanford study analyzing billions of social media posts reveals that only ~3% of users generate severely toxic content, but engagement-driven algorithms disproportionately amplify this minority, distorting public perception and driving self-censorship among the majority.