Confidence-Building Measures for Artificial Intelligence: Workshop proceedings
Summary
OpenAI and UC Berkeley's workshop on Confidence-Building Measures for Artificial Intelligence brought together stakeholders to develop strategies for mitigating geopolitical risks from foundation models, identifying six key CBMs including crisis hotlines, incident sharing, model transparency, content provenance, red teaming, and dataset sharing.
View Cached Full Text
Cached at: 04/20/26, 02:54 PM
Similar Articles
Concrete AI safety problems
OpenAI, Berkeley, and Stanford researchers co-authored a foundational paper identifying five concrete safety problems in modern AI systems: safe exploration, robustness to distributional shift, avoiding negative side effects, preventing reward hacking, and scalable oversight.
Preparing for malicious uses of AI
OpenAI co-authors a comprehensive paper forecasting malicious uses of AI and proposing mitigation strategies, developed in collaboration with leading research institutions. The work emphasizes acknowledging AI's dual-use nature, learning from cybersecurity practices, and broadening stakeholder discussions around AI security risks.
OpenAI Built Intelligence. Who Will Build Trust?
AutoFlow discusses the critical challenge of trust in AI, proposing external verification methods such as knowledge graphs and mathematical consistency checks, and announces acceptance into the NVIDIA Inception Program to advance research into trustworthy AI systems.
OpenAI safety practices
OpenAI outlines 10 safety practices it actively uses and improves upon, including empirical red-teaming, alignment research, abuse monitoring, and voluntary commitments shared at the AI Seoul Summit. The company emphasizes a balanced, scientific approach to safety integrated into development from the outset.
AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
AICompanionBench introduces the first publicly available benchmark dataset of 2,123 real-world AI companion conversations annotated across nine safety risk categories, used to evaluate 20 LLMs as safety judges. Results show strong models handle explicit harmful content well but struggle with nuanced risks like manipulation and false positives on benign conversations.