Lessons learned on language model safety and misuse

OpenAI Blog 03/03/22, 08:00 AM News

language-models ai-safety content-moderation evaluation-metrics deployment responsible-ai

Summary

OpenAI shares lessons learned on language model safety and misuse, discussing challenges in measuring risks, the limitations of existing benchmarks, and their development of new evaluation metrics for toxicity and policy violations. The post also highlights concerns about labor market impacts and the need for continued research on measuring social effects of AI deployment at scale.

We describe our latest thinking in the hope of helping other AI developers address safety and misuse of deployed models.

Original Article

View Cached Full Text

Cached at: 04/20/26, 02:46 PM

# Lessons learned on language model safety and misuse Source: [https://openai.com/index/language-model-safety-and-misuse/](https://openai.com/index/language-model-safety-and-misuse/) Many aspects of language models’ risks and impacts remain hard to measure and therefore hard to monitor, minimize, and disclose in an accountable way\. We have made active use of existing academic benchmarks for language model evaluation and are eager to continue building on external work, but we have also have found that existing benchmark datasets are often not reflective of the safety and misuse risks we see in practice\.[E](https://openai.com/index/language-model-safety-and-misuse/#citation-bottom-E) Such limitations reflect the fact that academic datasets are seldom created for the explicit purpose of informing production use of language models, and do not benefit from the experience gained from deploying such models at scale\. As a result, we’ve been developing new evaluation datasets and frameworks for measuring the safety of our models, which we plan to release soon\. Specifically, we have developed new evaluation metrics for measuring toxicity in model outputs and have also developed in\-house classifiers for detecting content that violates our[content policy⁠\(opens in a new window\)](https://beta.openai.com/docs/usage-guidelines/content-guidelines), such as erotic content, hate speech, violence, harassment, and self\-harm\. Both of these in turn have also been leveraged for improving our pre\-training data[F](https://openai.com/index/language-model-safety-and-misuse/#citation-bottom-F)—specifically, by using the classifiers to filter out content and the evaluation metrics to measure the effects of dataset interventions\. Reliably classifying individual model outputs along various dimensions is difficult, and measuring their social impact at the scale of the OpenAI API is even harder\. We have conducted several internal studies in order to build an institutional muscle for such measurement, but these have often raised more questions than answers\. We are particularly interested in better understanding the economic impact of our models and the distribution of those impacts\. We have good reason to believe that the labor market impacts from the deployment of current models may be significant in absolute terms already, and that they will grow as the capabilities and reach of our models grow\. We have learned of a variety of local effects to date, including massive productivity improvements on existing tasks performed by individuals like copywriting and summarization \(sometimes contributing to job displacement and creation\), as well as cases where the API unlocked new applications that were previously infeasible, such as[synthesis of large\-scale qualitative feedback⁠](https://openai.com/index/gpt-3-apps/)\. But we lack a good understanding of the net effects\. We believe that it is important for those developing and deploying powerful AI technologies to address both the positive and negative effects of their work head\-on\. We discuss some steps in that direction in the concluding section of this post\. Each of the lessons above raises new questions of its own\. What kinds of safety incidents might we still be failing to detect and anticipate? How can we better measure risks and impacts? How can we continue to improve both the safety and utility of our models, and navigate tradeoffs between these two when they do arise? We are actively discussing many of these issues with other companies deploying language models\. But we also know that no organization or set of organizations has all the answers, and we would like to highlight several ways that readers can get more involved in understanding and shaping our deployment of state of the art AI systems\. First, gaining first\-hand experience interacting with state of the art AI systems is invaluable for understanding their capabilities and implications\. We recently ended the API waitlist after building more confidence in our ability to effectively detect and respond to misuse\. Individuals in[supported countries and territories⁠\(opens in a new window\)](https://beta.openai.com/docs/supported-countries)can quickly get access to the OpenAI API by signing up[here⁠](https://openai.com/api/)\. Second, researchers working on topics of particular interest to us such as bias and misuse, and who would benefit from financial support, can apply for subsidized API credits using[this form⁠\(opens in a new window\)](https://share.hsforms.com/1b-BEAq_qQpKcfFGKwwuhxA4sk30)\. External research is vital for informing both our understanding of these multifaceted systems, as well as wider public understanding\. Finally, today we are publishing a[research agenda⁠](https://openai.com/index/economic-impacts/)exploring the labor market impacts associated with our Codex family of models, and a call for external collaborators on carrying out this research\. We are excited to work with independent researchers to study the effects of our technologies in order to inform appropriate policy interventions, and to eventually expand our thinking from code generation to other modalities\.

Lessons learned on language model safety and misuse

Similar Articles

Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk

@_lamaahmad: We (@CedricWhitney, @SandhiniAgarwal, @EstherTetruas, @OliviaGWatkins2, @dgrobinson) wrote about nuances we’ve observed…

Best practices for deploying language models

Helping developers build safer AI experiences for teens

@OpenAI: As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyon…

Submit Feedback

Similar Articles

Forecasting potential misuses of language models for disinformation campaigns and how to reduce risk

@_lamaahmad: We (@CedricWhitney, @SandhiniAgarwal, @EstherTetruas, @OliviaGWatkins2, @dgrobinson) wrote about nuances we’ve observed…

Best practices for deploying language models

Helping developers build safer AI experiences for teens

@OpenAI: As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyon…