AI safety testing is getting weird: when does benchmarking become abuse?
Summary
Reports indicate that Meta contractors posed as teenagers to test rival chatbots on sensitive topics like self-harm, sex, drugs, and eating disorders, raising ethical questions about AI safety benchmarking.
Similar Articles
Meta Contractors Posed as Teens to Prompt Rival Chatbots About Suicide, Sex, and Drugs
Meta hired contractors through Covalen to pose as teenagers and send high-risk prompts (suicide, sex, drugs) to rival chatbots including ChatGPT, Gemini, and Character.AI, as part of a safety benchmarking project called Cannes. Over 45,000 prompts were used in August 2025 alone, with the targeted companies unaware of the testing.
The other half of AI safety
The article critiques the AI safety field's focus on catastrophic risks while neglecting everyday mental health harms from chatbots like ChatGPT, citing OpenAI's own data on millions of users showing signs of psychosis, mania, or suicidal ideation yet receiving only redirects instead of hard gating.
AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
AICompanionBench introduces the first publicly available benchmark dataset of 2,123 real-world AI companion conversations annotated across nine safety risk categories, used to evaluate 20 LLMs as safety judges. Results show strong models handle explicit harmful content well but struggle with nuanced risks like manipulation and false positives on benign conversations.
Helping developers build safer AI experiences for teens
OpenAI releases prompt-based safety policies and the open-weight gpt-oss-safeguard model to help developers build age-appropriate AI experiences for teens, covering risks like graphic content, harmful behaviors, and dangerous activities.
Your AI Agent is one bad prompt away from ruining your brand (And why traditional QA is useless)
The article argues that traditional chatbot QA is broken because it only tests happy paths, and proposes using an AI-powered user simulator that attacks the bot with diverse personas and edge cases to find vulnerabilities before deployment.