OpenAI, Berkeley, and Stanford researchers co-authored a foundational paper identifying five concrete safety problems in modern AI systems: safe exploration, robustness to distributional shift, avoiding negative side effects, preventing reward hacking, and scalable oversight.
We (along with researchers from Berkeley and Stanford) are co-authors on today’s paper led by Google Brain researchers, Concrete Problems in AI Safety. The paper explores many research problems around ensuring that modern machine learning systems operate as intended.
# Concrete AI safety problems
Source: [https://openai.com/index/concrete-ai-safety-problems/](https://openai.com/index/concrete-ai-safety-problems/)
OpenAIWe \(along with researchers from Berkeley and Stanford\) are co\-authors on today’s paper led by Google Brain researchers,[Concrete Problems in AI Safety\(opens in a new window\)](https://arxiv.org/abs/1606.06565)\. The paper explores many research problems around ensuring that modern machine learning systems operate as intended\.
The authors discuss five areas:
- **Safe exploration**\.*Can*[*reinforcement learning*\(opens in a new window\)](http://karpathy.github.io/2016/05/31/rl/)*\(RL\) agents learn about their environment without executing catastrophic actions?*For example, can an RL agent learn to navigate an environment without ever falling off a ledge?
- **Robustness to distributional shift**\.*Can machine learning systems be robust to changes in the data distribution, or at least fail gracefully?*For example, can we build[image classifiers\(opens in a new window\)](https://www.tensorflow.org/versions/r0.9/tutorials/deep_cnn/index.html)that indicate appropriate uncertainty when shown new kinds of images, instead of confidently trying to use its[potentially inapplicable\(opens in a new window\)](http://arxiv.org/abs/1412.6572)learned model?
- **Avoiding negative side effects**\.*Can we transform an RL agent’s*[*reward function*\(opens in a new window\)](https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node9.html)*to avoid undesired effects on the environment?*For example, can we build a robot that will move an object while avoiding knocking anything over or breaking anything, without manually programming a separate penalty for each possible bad behavior?
- **Avoiding “reward hacking” and “**[**wireheading**\(opens in a new window\)](http://www.agroparistech.fr/mmip/maths/laurent_orseau/papers/ring-orseau-AGI-2011-delusion.pdf)**”**\.*Can we prevent agents from “gaming” their reward functions, such as by distorting their observations?*For example, can we train an RL agent to minimize the number of dirty surfaces in a building, without causing it to avoid looking for dirty surfaces or to create new dirty surfaces to clean up?
- **Scalable oversight**\.*Can RL agents efficiently achieve goals for which feedback is very expensive?*For example, can we build an agent that tries to clean a room in the way the user would be happiest with, even though feedback from the user is very rare and we have to use cheap approximations \(like the presence of visible dirt\) during training? The divergence between cheap approximations and what we actually care about is an important source of accident risk\.
Many of the problems are not new, but the paper explores them in the context of cutting\-edge systems\. We hope they’ll inspire more people to work on AI safety research, whether[at OpenAI](https://openai.com/careers/)or elsewhere\.
We’re particularly excited to have participated in this paper as a cross\-institutional collaboration\. We think that broad AI safety collaborations will enable everyone to build better machine learning systems\.[Let us know\(opens in a new window\)](https://gitter.im/openai/research)if you have a future paper you’d like to collaborate on\!
OpenAI outlines 10 safety practices it actively uses and improves upon, including empirical red-teaming, alignment research, abuse monitoring, and voluntary commitments shared at the AI Seoul Summit. The company emphasizes a balanced, scientific approach to safety integrated into development from the outset.
OpenAI publishes a policy research paper identifying four strategies to improve industry cooperation on AI safety norms: communicating risks/benefits, technical collaboration, increased transparency, and incentivizing standards. The analysis addresses how competitive pressures could lead to under-investment in safety and proposes mechanisms to align incentives toward safe AI development.
OpenAI outlines its comprehensive approach to AI safety, emphasizing rigorous testing, iterative deployment, real-world monitoring, and regulatory engagement to ensure powerful AI systems are built and used safely.
OpenAI argues that AI safety research on value alignment requires social scientists to help address how human cognitive biases and inconsistencies affect the data used to train AI systems. The organization proposes human-only experiments as a method to uncover alignment problems before deploying machine learning solutions.
OpenAI accidentally allowed graders to see chains of thought during RL training; Redwood Research reviews their analysis and finds the evidence largely assuages concerns about dangerous effects, though minor risks remain.