A hazard analysis framework for code synthesis large language models

OpenAI Blog Papers

Summary

OpenAI presents a hazard analysis framework for evaluating safety risks associated with code synthesis LLMs like Codex, examining technical, social, political, and economic impacts through a novel evaluation methodology for code generation capabilities.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:46 PM

# A hazard analysis framework for code synthesis large language models Source: [https://openai.com/index/a-hazard-analysis-framework-for-code-synthesis-large-language-models/](https://openai.com/index/a-hazard-analysis-framework-for-code-synthesis-large-language-models/) ## Abstract Codex, a large language model \(LLM\) trained on a variety of codebases, exceeds the previous state of the art in its capacity to synthesize and generate code\. Although Codex provides a plethora of benefits, models that may generate code on such scale have significant limitations, alignment problems, the potential to be misused, and the possibility to increase the rate of progress in technical fields that may themselves have destabilizing impacts or have misuse potential\. Yet such safety impacts are not yet known or remain to be explored\. In this paper, we outline a hazard analysis framework constructed at OpenAI to uncover hazards or safety risks that the deployment of models like Codex may impose technically, socially, politically, and economically\. The analysis is informed by a novel evaluation framework that determines the capacity of advanced code generation techniques against the complexity and expressivity of specification prompts, and their capability to understand and execute them relative to human ability\.

Similar Articles

Evaluating large language models trained on code

OpenAI Blog

OpenAI introduces Codex, a GPT model fine-tuned on GitHub code, achieving 28.8% functional correctness on HumanEval (a new benchmark for code synthesis from docstrings), significantly outperforming GPT-3 (0%) and GPT-J (11.4%). The paper demonstrates that repeated sampling improves performance to 70.2% with 100 samples, and discusses limitations and broader impacts of code generation systems.

Running Codex safely at OpenAI

OpenAI Blog

OpenAI details how it deploys Codex with safety controls including sandboxing, approval policies, network policies, and agent-native telemetry to ensure secure operation of coding agents in enterprise environments.

Lessons learned on language model safety and misuse

OpenAI Blog

OpenAI shares lessons learned on language model safety and misuse, discussing challenges in measuring risks, the limitations of existing benchmarks, and their development of new evaluation metrics for toxicity and policy violations. The post also highlights concerns about labor market impacts and the need for continued research on measuring social effects of AI deployment at scale.