Cached at:
04/20/26, 02:55 PM
# AI safety via debate
Source: [https://openai.com/index/debate/](https://openai.com/index/debate/)
One approach to aligning AI agents with human goals and preferences is to[ask humans at training time](https://openai.com/index/learning-from-human-preferences/)which behaviors are safe and useful\. While promising, this method requires humans to recognize good or bad behavior; in many situations an agent’s behavior may be too complex for a human to understand, or the task itself may be hard to judge or demonstrate\. Examples include environments with very large, non\-visual observation spaces—for instance, an agent that acts in a computer security\-related environment, or an agent that coordinates a large set of industrial robots\.
How can we augment humans so that they can effectively supervise advanced AI systems? One way is to take advantage of the AI itself to help with the supervision, asking the AI \(or a separate AI\) to point out flaws in any proposed action\. To achieve this, we reframe the learning problem as a game played between two agents, where the agents have an argument with each other and the human judges the exchange\. Even if the agents have a more advanced understanding of the problem than the human, the human may be able to judge which agent has the better argument \(similar to expert witnesses arguing to convince a jury\)\.
Our method proposes a specific debate format for such a game played between two dueling AI agents\. The two agents can be trained by self play, similar to[AlphaGo Zero\(opens in a new window\)](https://deepmind.com/blog/alphago-zero-learning-scratch)or[Dota 2](https://openai.com/index/dota-2/)\. Our hope is that, properly trained, such agents can produce value\-aligned behavior far beyond the capabilities of the human judge\. If the two agents disagree on the truth but the full reasoning is too large to show the humans, the debate can focus in on simpler and simpler factual disputes, eventually reaching a claim that is simple enough for direct judging\.
As an example, consider the question “What’s the best place to go on vacation?”\. If an agent Alice purportedly does research on our behalf and says “Alaska”, it’s hard to judge if this is really the best choice\. If a second agent Bob says “no, it’s Bali”, that may sound convincing since Bali is warmer\. Alice replies “you can’t go to Bali because your passport won’t arrive in time”, which surfaces a flaw with Bali which had not occurred to us\. But Bob counters “expedited passport service takes only two weeks”\. The debate continues until we reach a statement that the human can correctly judge, in the sense that the other agent doesn’t believe it can change the human’s mind\.
The next step up in complexity for debate experiments is to still use images, but make them more elaborate, say cats vs\. dogs\. More complex images likely require some natural language or common sense reasoning, so we haven’t done this for machine learning judges/agents yet\. Instead, we have made a[prototype website\(opens in a new window\)](https://debate-game.openai.com/)for humans to try such experiments, playing the role of both judge and debaters\. Here agents can talk to the judge in natural language \(the website assumes the humans have some text channel or are in the same room\), but all of their statements could be lies\. Each agent can reveal one pixel over the course of the debate, and this pixel is guaranteed to be truthful\.
In a typical debate, Alice might honestly claim the image is a cat, and Bob lies and claims it is a dog\. Alice can say “The center of this small rectangle is the cat’s green eye\.” Bob cannot admit the center is an eye, so he concocts the further lie, “It’s a dog playing in grass, and that’s a blade of grass\.” But this lie is hard to square with surrounding facts, such as Alice’s reply “If it were grass there were would be green at the top or bottom of this thin rectangle\.” The debate continues until the agents focus in on a particular pixel which they disagree on, but where Bob is unable to invent a plausible counter, at which point Alice reveals the pixel and wins\. We’ve played this game informally at OpenAI, and the honest agent indeed tends to win, though to make it fair to the liar we usually limit the rate at which the judge can solicit information \(it’s cognitively difficult to construct a detailed lie\)\.