The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI Blog Papers

Summary

OpenAI proposes an instruction hierarchy approach to defend LLMs against prompt injection and jailbreak attacks by training models to prioritize system instructions over user inputs. The method significantly improves robustness without degrading standard capabilities.

Today's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:47 PM

# The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Source: [https://openai.com/index/the-instruction-hierarchy/](https://openai.com/index/the-instruction-hierarchy/) OpenAIToday's LLMs are susceptible to prompt injections, jailbreaks, and other attacks that allow adversaries to overwrite a model's original instructions with their own malicious prompts\. In this work, we argue that one of the primary vulnerabilities underlying these attacks is that LLMs often consider system prompts \(e\.g\., text from an application developer\) to be the same priority as text from untrusted users and third parties\. To address this, we propose an instruction hierarchy that explicitly defines how models should behave when instructions of different priorities conflict\. We then propose a data generation method to demonstrate this hierarchical instruction following behavior, which teaches LLMs to selectively ignore lower\-privileged instructions\. We apply this method to GPT‑3\.5, showing that it drastically increases robustness \-\- even for attack types not seen during training \-\- while imposing minimal degradations on standard capabilities\.

Similar Articles

Improving instruction hierarchy in frontier LLMs

OpenAI Blog

OpenAI presents a training approach using instruction-hierarchy tasks to improve LLM safety and reliability by teaching models to properly prioritize instructions based on trust levels (system > developer > user > tool). The method addresses prompt-injection attacks and safety steerability through reinforcement learning with a new dataset called IH-Challenge.

Understanding prompt injections: a frontier security challenge

OpenAI Blog

OpenAI publishes guidance on prompt injection attacks, a social engineering vulnerability where malicious instructions hidden in web content or documents can trick AI models into unintended actions. The company outlines its multi-layered defense strategy including instruction hierarchy research, automated red-teaming, and AI-powered monitoring systems.

Learning to reason with LLMs

OpenAI Blog

OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.

Aligning language models to follow instructions

OpenAI Blog

OpenAI introduces InstructGPT, a GPT-3 variant fine-tuned using reinforcement learning from human feedback (RLHF) to better follow instructions and reduce harmful outputs. A 1.3B InstructGPT model is preferred by human evaluators over a 175B GPT-3 model, now becoming the default on OpenAI's API.

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.