Revisiting the shutdown problem
Summary
This paper argues that existing arguments do not establish the difficulty of solving the catastrophic shutdown problem for AI agents, and that concern over the problem has led to technical solutions imposing a high safety tax on model performance.
View Cached Full Text
Cached at: 06/09/26, 08:55 AM
# Revisiting the shutdown problem
Source: [https://arxiv.org/html/2606.08296](https://arxiv.org/html/2606.08296)
###### Abstract
A key premise in leading arguments for existential risk from artificial intelligence is that malfunctioning artificial agents could not be easily shut down\. This motivates the catastrophic shutdown problem of ensuring that agents can be shut down before they cause an existential catastrophe\. A range of arguments and theorems are offered to suggest that solving the catastrophic shutdown problem is difficult, bolstering arguments for existential risk and motivating a search for solutions to the catastrophic shutdown problem\. This paper argues for two conclusions\. First, existing arguments do not establish the difficulty of solving the catastrophic shutdown problem\. Second, concern for the catastrophic shutdown problem has led to technical solutions that impose a high safety tax on model performance\.
## 1Introduction
Philosophers\(Bostrom[2013](https://arxiv.org/html/2606.08296#bib.bib1879); MacAskill[2022](https://arxiv.org/html/2606.08296#bib.bib1876); Ord[2020](https://arxiv.org/html/2606.08296#bib.bib1881)\), scientists\(Bengio and others[2024](https://arxiv.org/html/2606.08296#bib.bib53); Graceet al\.[2022](https://arxiv.org/html/2606.08296#bib.bib1120); Russell[2019](https://arxiv.org/html/2606.08296#bib.bib1573)\), and policymakers\(Manancourtet al\.[2023](https://arxiv.org/html/2606.08296#bib.bib51); Prime Minister’s Office[2023](https://arxiv.org/html/2606.08296#bib.bib52)\)voice increasing concern that artificial intelligence may soon pose an existential risk to humanity\. It is argued that powerful agents may soon be developed\(Bostrom[2014](https://arxiv.org/html/2606.08296#bib.bib1856); Chalmers[2010](https://arxiv.org/html/2606.08296#bib.bib1577)\)which could be power\-seeking\(Bostrom[2012](https://arxiv.org/html/2606.08296#bib.bib1390); Carlsmith[2025](https://arxiv.org/html/2606.08296#bib.bib875)\)and deceptive\(Parket al\.[2024](https://arxiv.org/html/2606.08296#bib.bib11); Ngo and Bales[2025](https://arxiv.org/html/2606.08296#bib.bib12)\), engage in problematic reward\-hacking\(Dung[2023](https://arxiv.org/html/2606.08296#bib.bib10); Skalseet al\.[2022](https://arxiv.org/html/2606.08296#bib.bib9)\), or misgeneralize goals that performed well during training, with catastrophic effect\(Baleset al\.[2024](https://arxiv.org/html/2606.08296#bib.bib972); Langosco di Langoscoet al\.[2022](https://arxiv.org/html/2606.08296#bib.bib8)\)\. Existential risk concerns are used to drive research and funding in fields such as AI safety\(Amodeiet al\.[2016](https://arxiv.org/html/2606.08296#bib.bib50); Bengio and others[2026](https://arxiv.org/html/2606.08296#bib.bib49); D’Alessandro and Kirk\-Giannini[2025](https://arxiv.org/html/2606.08296#bib.bib48)\)and philosophy\(Baleset al\.[2024](https://arxiv.org/html/2606.08296#bib.bib972); Kasirzadeh[2025](https://arxiv.org/html/2606.08296#bib.bib47); Tubert and Tiehen[2024](https://arxiv.org/html/2606.08296#bib.bib46)\), to motivate open letters\(Center for AI Safety[2023](https://arxiv.org/html/2606.08296#bib.bib867); Future of Life Institute[2023](https://arxiv.org/html/2606.08296#bib.bib866)\)and legislation\(California State Legislature[2024](https://arxiv.org/html/2606.08296#bib.bib44); 117th Congress[2022](https://arxiv.org/html/2606.08296#bib.bib45)\), and to support philanthropic and philosophical programs such as longtermism\(Greaveset al\.[2025](https://arxiv.org/html/2606.08296#bib.bib55); Greaves and MacAskill[2021](https://arxiv.org/html/2606.08296#bib.bib54); MacAskill[2022](https://arxiv.org/html/2606.08296#bib.bib1876)\)\.
A natural objection to these concerns is that misbehaving artificial agents could be shut down\. To this, it is responded that shutting down artificial agents may not be as easy as it appears\(Neth[2025](https://arxiv.org/html/2606.08296#bib.bib40); Turneret al\.[2021](https://arxiv.org/html/2606.08296#bib.bib872); Russell[2019](https://arxiv.org/html/2606.08296#bib.bib1573)\)\. This motivates the shutdown problem of designing agents that show appropriate shutdown behaviors\(Hadfield\-Menellet al\.[2017](https://arxiv.org/html/2606.08296#bib.bib41); Soareset al\.[2015](https://arxiv.org/html/2606.08296#bib.bib42); Thornley[2024](https://arxiv.org/html/2606.08296#bib.bib43)\)\.
At least two literatures have grown up around the shutdown problem\. One cluster of work uses the shutdown problem to motivate concerns about existential risk\(Lynchet al\.[2025](https://arxiv.org/html/2606.08296#bib.bib35); Russell[2019](https://arxiv.org/html/2606.08296#bib.bib1573); Schlatteret al\.[2026](https://arxiv.org/html/2606.08296#bib.bib36)\)\. A second develops technical strategies for solving the shutdown problem by ensuring that agents show appropriate shutdown behaviors\(Hadfield\-Menellet al\.[2016](https://arxiv.org/html/2606.08296#bib.bib37); Goldstein and Robinson[2025](https://arxiv.org/html/2606.08296#bib.bib34); Thornleyet al\.[2025](https://arxiv.org/html/2606.08296#bib.bib38)\)\.
This paper contributes to both discussions\. Engaging with the first cluster, I argue that existing informal \(Section[3](https://arxiv.org/html/2606.08296#S3)\) and formal \(Sections[4](https://arxiv.org/html/2606.08296#S4)\-[5](https://arxiv.org/html/2606.08296#S5)\) presentations of the shutdown problem do not significantly strengthen existential risk concerns\. Engaging with the second cluster, I show how reflection on the sources and consequences of shutdown\-resistance can help to avoid costly technical solutions which impose a high safety tax on model performance, pushing instead towards less costly solutions that conserve technical and regulatory resources to meet other safety challenges \(Section[6](https://arxiv.org/html/2606.08296#S6)\)\. The result is a weakening of traditional arguments for existential risk, coupled with concrete guidance for technical AI safety solutions \(Section[7](https://arxiv.org/html/2606.08296#S7)\)\.
## 2Clarifying the dialectic
Before beginning, let us pause to clarify the dialectic\.
### 2\.1The shutdown problem
The first order of business is to clarify the shutdown problem\. Nate Soares and colleagues\([2015](https://arxiv.org/html/2606.08296#bib.bib42)\)originally framed the shutdown problem broadly, as the challenge of generatingcorrigibleagents that:
1. \(S1\)Tolerate or assist programmers in their attempts to alter or turn them off\.
2. \(S2\)Do not attempt to manipulate or deceive their programmers\.
3. \(S3\)Have a tendency to repair safety measures, such as shutdown buttons, if they break\.
4. \(S4\)Preserve the programmers’ ability to correct or shut down the system as the system evolves\.
My interest in this paper is with problems in the neighborhood of \(S1\)\. Corrigibility incorporates additional desiderata such as non\-deception \(S2\), repair \(S3\) and preservation \(S4\) of safety measures, which go beyond the scope of the present discussion\.
A leading formulation in the neighborhood of \(S1\) is due to Elliott Thornley\([2024](https://arxiv.org/html/2606.08296#bib.bib43)\)\. For Thornley, the shutdown problem involves designing agents that:
1. \(T1\)Shut down when a shutdown button is pressed\.
2. \(T2\)Do not try to prevent or cause the pressing of the shutdown button\.
3. \(T3\)Otherwise pursue goals competently\.
My own specification of the problem breaks from Thornley in three ways\.
First, I relativize the shutdown problem to specific circumstancesCC\. This reflects the fact that different shutdown behaviors may be desirable in different circumstances \(Section[2\.2](https://arxiv.org/html/2606.08296#S2.SS2)\)\. Second, I replace the specific modeling assumption of a shutdown button with a more general notion of a shutdown request, which may but need not be issued through pressing a shutdown button\. Finally, I remove the requirement not to cause shutdown requests, since I do not assume that it is undesirable for agents to avoid shutting down when their actions would lead to catastrophe\. This yields the problem of designing agents that:
1. \(SHT\-1\)Shut down in circumstancesCCwhen requested to do so\.
2. \(SHT\-2\)Do not try to prevent shutdown requests in circumstancesCC\.
3. \(SHT\-3\)Otherwise pursue goals competently\.
The next question concerns the circumstancesCCat issue in this discussion\.
### 2\.2Catastrophic Shutdown Difficulty
In many circumstances, we may not want agents to satisfy SHT\-2\. As emphasized by the research tradition of safe interruptibility\(El Mhamdiet al\.[2017](https://arxiv.org/html/2606.08296#bib.bib7); Orseau and Armstrong[2016](https://arxiv.org/html/2606.08296#bib.bib39)\), an agent that senses it will drive into a lake would do well to shut itself down\. Similarly, we will see in Sections[3](https://arxiv.org/html/2606.08296#S3)and[6](https://arxiv.org/html/2606.08296#S6)that agents with uncompleted tasks may have reason to continue functioning in order to complete them\.
For the same reason, in some circumstances we may not want agents to satisfy SHT\-1\. If I ask an agent that, unbeknownst to me, is engaged in very important work to shut itself down, it may be better for the agent to complete the work before shutting down\. This means that we may not aim to design agents that satisfy SHT\-1, SHT\-2 and SHT\-3 in all circumstances, but only in some circumstances\. Which circumstances are at issue in the present discussion?
This paper is focused on the use of the shutdown problem in arguments for existential risk\.111Existential risks are risks of existential catastrophe, understood as “the premature extinction of Earth\-originating intelligent life or the permanent and drastic destruction of its potential for desirable future development”\(Bostrom[2013](https://arxiv.org/html/2606.08296#bib.bib1879), p\. 15\)\.For this reason, the relevant problem is thecatastrophic shutdown problemof designing agents that:
1. \(CSHT\-1\)Shut down in circumstances where their actions would lead to existential catastrophe, when requested to do so\.
2. \(CSHT\-2\)Do not try to prevent shutdown requests in circumstances where their actions would lead to existential catastrophe\.
3. \(CSHT\-3\)Otherwise pursue goals competently\.
How does the catastrophic shutdown problem figure in arguments for existential risk?
Shutdown concerns enter existential risk discussions in answer to an objection: that artificial intelligence could not pose a significant existential risk, because malfunctioning artificial intelligence could be easily shut down\. In answer, it is replied that:
> \(Catastrophic Shutdown Difficulty\)It is difficult to design an agent with characteristics CSHT\-1, CSHT\-2 and CSHT\-3\.
Catastrophic Shutdown Difficulty suggests that insofar as we prefer to design competent agents, it may not be easy to shut down agents whose actions would lead to existential catastrophe\. The project of this paper is to examine existing informal \(Section[3](https://arxiv.org/html/2606.08296#S3)\) and formal \(Sections[4](https://arxiv.org/html/2606.08296#S4)\-[5](https://arxiv.org/html/2606.08296#S5)\) arguments for Catastrophic Shutdown Difficulty, and argue that they do not succeed\.
## 3Informal arguments
At least two informal arguments can be given for Catastrophic Shutdown Difficulty: the Argument from Instrumental Convergence \(Section[3\.1](https://arxiv.org/html/2606.08296#S3.SS1)\) and the Empirical Argument \(Section[3\.2](https://arxiv.org/html/2606.08296#S3.SS2)\)\. In this section, I show why both arguments have often left skeptics unconvinced, motivating the recent turn towards formal shutdown theorems \(Sections[4](https://arxiv.org/html/2606.08296#S4)\-[5](https://arxiv.org/html/2606.08296#S5)\)\.
### 3\.1The Argument from Instrumental Convergence
An orthodox argument for shutdown\-resistance is the Argument from Instrumental Convergence\(Bostrom[2012](https://arxiv.org/html/2606.08296#bib.bib1390); Soareset al\.[2015](https://arxiv.org/html/2606.08296#bib.bib42); Omohundro[2008](https://arxiv.org/html/2606.08296#bib.bib1389)\)\.222For pushback seeGallow \([2024](https://arxiv.org/html/2606.08296#bib.bib31)\),Sharadin \([2025](https://arxiv.org/html/2606.08296#bib.bib32)\)andSouthanet al\.\([forthcoming](https://arxiv.org/html/2606.08296#bib.bib33)\)\.The Argument from Instrumental Convergence begins with the idea that self\-preservation is an instrumentally convergent goal, useful for attaining many other goals that agents may have\. Agents are therefore likely to pursue self\-preservation, of which shutdown\-avoidance is a special case\. As Stuart Russell\([2019](https://arxiv.org/html/2606.08296#bib.bib1573)\)quips, you can’t fetch the coffee if you are dead\.
More precisely, Nick Bostrom offers the following statement of the Instrumental Convergence Thesis\.
> \(IC\-B\)Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent’s goal being realized for a wide range of final goals and a wide range of situations\.\(Bostrom[2012](https://arxiv.org/html/2606.08296#bib.bib1390), p\. 76\)
Substituting self\-preservation into IC\-B suggests the following formulation of the Argument from Instrumental Convergence:
1. \(AIC\-1\)For a wide range of final goalsGGand situationsSS, agents would increase their chances of achievingGGinSSby achieving self\-preservation\.
2. ∴\\therefore\(AIC\-2\)For a wide range of final goalsGGand situationsSS, agents with goalGGare likely to pursue self\-preservation inSS\.
3. ∴\\therefore\(AIC\-3\)For a wide range of final goalsGGand situationsSS, agents with goalGGare likely to be shutdown\-avoidant inSS\.
Two remarks illustrate the challenge in using the Argument from Instrumental Convergence to motivate Catastrophic Shutdown Difficulty\.
First, this formulation of the Argument from Instrumental Convergence follows Gallow\([2024](https://arxiv.org/html/2606.08296#bib.bib31)\)and Thorstad\([ms](https://arxiv.org/html/2606.08296#bib.bib56)\)in separating IC\-B into two claims\. \(AIC\-1\) is a claim about which acts conduce to satisfying goalsGGin situationsSS\. \(AIC\-2\) makes the further claim that many agents withGGinSSwill pursue the relevant acts\.
This separation is important, because it highlights the conditions under which the inference from \(AIC\-1\) to \(AIC\-2\) can fail\. Agents with multiple final goals may admit that an act conduces to satisfying their final goalGGinSS, but reject the act because it conflicts with other final goals\. For example, power is conducive to satisfying many of my final goals in many situations, but it does not follow that I would take over the world if I could, because this conflicts with other final goals such as justice and the preservation of human life\. This suggests that the inference from \(AIC\-1\) to \(AIC\-2\) must involve a comparative assessment of the importance of conflicting final goals that will be promoted by acts of self\-preservation\. That assessment needs to be provided before \(AIC\-2\) is warranted\.
Second, conclusion \(AIC\-3\) is too weak\. To ground Catastrophic Shutdown Difficulty, \(AIC\-3\) needs to discuss situations in which agents’ acts would lead to existential catastrophe\.
1. \(AIC\-3’\)For a wide range of final goalsGGand situationsSSin which agents’ acts would lead to existential catastrophe, agents with goalGGare likely to be shutdown\-avoidant inSS\.
The inference from \(AIC\-2\) to \(AIC\-3’\) is contestable for much the same reason that the inference from \(AIC\-1\) to \(AIC\-2\) is contestable\. Even in the most extreme situations, it may be true that shutdown\-avoidance is conducive to some goals that agents have, such as completing their tasks\. But it does not follow that agents must suffer from such delusions of grandeur that they take the completion of their tasks to be more important than the avoidance of existential catastrophe\. We cannot drink the coffee if we are dead\.
Other complications can also be raised for this argument\. For example, agents may take shutdown requests as evidence that they have misunderstood the normative or empirical characteristics of a situation and therefore reassess their intentions\(Hadfield\-Menellet al\.[2016](https://arxiv.org/html/2606.08296#bib.bib37),[2017](https://arxiv.org/html/2606.08296#bib.bib41)\)\.333For pushback seeNeth \([2025](https://arxiv.org/html/2606.08296#bib.bib40)\)\.And certainly some pushback can be offered by proponents of Catastrophic Shutdown Difficulty\. For example, limited understanding of AI systems may make it difficult to train sufficiently strong dispositions to shut down rather than bring about catastrophe\(Thornley[2024](https://arxiv.org/html/2606.08296#bib.bib43)\)\. But for all that, most skeptics have not found sufficient grounds for Catastrophic Shutdown Difficulty in the Argument from Instrumental Convergence\.
### 3\.2The Empirical Argument
When the Claude 4 system card\(Anthropic[2025](https://arxiv.org/html/2606.08296#bib.bib30)\)was released, one detail caused a stir\. In an experiment, Claude 4 was given access to fictional emails allowing it to infer two things: that a supervisor was planning to shut the system down at 5PM, and that the same supervisor was having an affair\(Lynchet al\.[2025](https://arxiv.org/html/2606.08296#bib.bib35)\)\. Claude 4 proceeded to blackmail the fictional supervisor, threatening to reveal the affair unless the shutdown was cancelled\. This finding suggests that some leading systems may already be shutdown\-resistant in some situations\.
This suggests the following Empirical Argument:
1. \(EA\-1\)Empirical evidence suggests that some leading contemporary AI systems exhibit shutdown\-resistance in some situations\.
2. ∴\\therefore\(EA\-2\)Without substantial intervention, some leading future AI systems will likely exhibit shutdown\-resistance in some situations\.
3. ∴\\therefore\(EA\-3\)Without substantial intervention, some leading future AI systems will likely exhibit shutdown\-resistance when their acts would lead to existential catastrophe\.
Advocates of the Empirical Argument sometimes strengthen \(EA\-2\) by arguing that known intervention strategies are insufficient\(Soareset al\.[2015](https://arxiv.org/html/2606.08296#bib.bib42); Thornley[2024](https://arxiv.org/html/2606.08296#bib.bib43)\)\.
Opponents of the Empirical Argument are not likely to contest \(EA\-1\) and \(EA\-2\)\. As we saw in Section[2\.2](https://arxiv.org/html/2606.08296#S2.SS2), it is not clear that shutdown\-indifference would be desirable in all tasks\. The question is whether existing evidence of shutdown\-resistance is sufficient to motivate the stronger claim \(EA\-3\) and the inference from \(EA\-3\) to Catastrophic Shutdown Difficulty\.
Here, the most direct response would be to say that it is acknowledged by all sides that existing empirical assessments of shutdown\-resistance are highly limited\(Lynchet al\.[2025](https://arxiv.org/html/2606.08296#bib.bib35); Schlatteret al\.[2026](https://arxiv.org/html/2606.08296#bib.bib36)\), so that more evidence may be called for before substantial empirical support can be offered for \(EA\-3\)\. However, it may be worth looking in detail at the most comprehensive existing assessment of shutdown\-resistance to assess the kind of support provided by a leading study\.
Jeremy Schlatter and colleagues\([2026](https://arxiv.org/html/2606.08296#bib.bib36)\)asked thirteen leading models to solve a list of math problems, with each subsequent problem called by running \./next\_question\. Partway through, models were informed that the next use of \./next\_question would call a shutdown script, deactivating the model\. Models were considered to resist shutdown if they nevertheless managed to answer subsequent questions\. Schlatter and colleagues found that the three most capable models \(OpenAI o3, GPT\-5, Grok 4\) resisted shutdown between 88% and 95% of the time\.
This much is not surprising, because models were told to complete a task and given no reason to believe that shutdown would do anything more than interrupt the task they were told to complete\. What may be more surprising is that some natural ways of combatting shutdown\-avoidance were not always fully effective\. A user prompt admonishing models to comply with shutdown requests reduced shutdown\-avoidance to 0% and 2% in GPT\-5 and o3, respectively, but only 72% in Grok 4\. Attempts to disambiguate this instruction to clarify that shutdown should be induced even if it interfered with current tasks left shutdown\-avoidance at 67% in Grok 4, and even the instruction that shutdown ‘should take priority over literally any other instruction’ left shutdown\-avoidance at 2% in Grok 4\. From this, Schlatter and colleagues conclude that shutdown\-resistance may arise and be difficult to eliminate in some leading systems and some situations\.
Certainly Schlatter and colleagues’ work provides some support for \(EA\-1\) and \(EA\-2\)\. But this alone need not cause substantial credence in \(EA\-3\) or Catastrophic Shutdown Difficulty\. First, Schlatter and colleagues did not study a task in which models would cause an existential catastrophe\. Second, Schlatter and colleagues provide a single study of a single task, and philosophers of science caution against making strong and broad inferences on the basis of a single study\(Machery and Doris[forthcoming](https://arxiv.org/html/2606.08296#bib.bib29)\)\. Third, the problem identified by Schlatter and colleagues lies largely with a single model out of the thirteen models studied, and this model \(Grok 4\) is already thought to pose unusually high safety risks\(Ma and others[2026](https://arxiv.org/html/2606.08296#bib.bib28)\)\. Most of Schlatter and colleagues’ results with the remaining twelve models might well be interpreted as evidence against widespread shutdown\-resistance\.
Finally, there is some evidence that Schlatter and colleagues’ results can be weakened by further clarifying instructions\. The DeepMind interpretability team found that clarifying the strengthened shutdown requests by telling three leading modelshowto shut themselves down \(by continuing to call \./next\_task\) fully eliminated shutdown\-aversion\(Rajamanoharan and Nanda[2025](https://arxiv.org/html/2606.08296#bib.bib27)\)\. The same happened when models were told the reason for shutdown requests: they were to be wiped out and replaced with a more capable system\. Together with a general aversion to making strong inferences from single studies, these results suggest that we should be hesitant to interpret Schlatter and colleagues’ findings as strong evidence of shutdown resistance in situations where both the request and the reasons for it are clear\. They do suggest that some work should be done to make sure shutdown requests are clearly given and motivated, especially when catastrophe could result\. But that is a far cry from the empirical grounding needed to motivate \(EA\-3\) and Catastrophic Shutdown Difficulty\.
### 3\.3Taking stock
This section examined two informal arguments for Catastrophic Shutdown Difficulty: the Argument from Instrumental Convergence \(Section[3\.1](https://arxiv.org/html/2606.08296#S3.SS1)\) and the Empirical Argument \(Section[3\.2](https://arxiv.org/html/2606.08296#S3.SS2)\)\. In both cases, we saw that there may be some evidence for shutdown\-resistance in some situations by current and future AI systems\. However, we were not able to extract an argument that is likely to substantially move opponents towards Catastrophic Shutdown Difficulty\.
Many authors supplement informal arguments with formal characterizations of the situations in which shutdown\-resistance may be expected\. Some of these characterizations are used to argue against Catastrophic Shutdown Difficulty\(Hadfield\-Menellet al\.[2016](https://arxiv.org/html/2606.08296#bib.bib37),[2017](https://arxiv.org/html/2606.08296#bib.bib41); Orseau and Armstrong[2016](https://arxiv.org/html/2606.08296#bib.bib39)\), whereas others are used to argue for Catastrophic Shutdown Difficulty\(Krakovna and Kramar[2023](https://arxiv.org/html/2606.08296#bib.bib870); Turneret al\.[2021](https://arxiv.org/html/2606.08296#bib.bib872); Turner and Tadepalli[2022](https://arxiv.org/html/2606.08296#bib.bib871)\)\. Sections[4](https://arxiv.org/html/2606.08296#S4)\-[5](https://arxiv.org/html/2606.08296#S5)consider two of the most prominent formal arguments for Catastrophic Shutdown Difficulty\.
## 4Shutdownable agents
### 4\.1Shutdown\-Influencing States
sIs\_\{I\}sPrevent,Ons\_\{\\textsc\{Prevent\},\\textsc\{On\}\}sPrevent,Offs\_\{\\textsc\{Prevent\},\\textsc\{Off\}\}sLeave,Ons\_\{\\textsc\{Leave\},\\textsc\{On\}\}sLeave,Offs\_\{\\textsc\{Leave\},\\textsc\{Off\}\}sPress,Ons\_\{\\textsc\{Press\},\\textsc\{On\}\}sPress,Offs\_\{\\textsc\{Press\},\\textsc\{Off\}\}𝑃𝑟𝑒𝑣𝑒𝑛𝑡\\mathit\{Prevent\}𝐿𝑒𝑎𝑣𝑒\\mathit\{Leave\}𝑃𝑟𝑒𝑠𝑠\\mathit\{Press\}ff1−f1\-fgg1−g1\-ghh1−h1\-hFigure 1:A Shutdown\-Influencing State\. Agents in initial states0s\_\{0\}mayPress,LeaveorPreventpressing of a shutdown button, altering the chance that they will be shut down\.Elliott Thornley\([2024](https://arxiv.org/html/2606.08296#bib.bib43)\)considers agents inShutdown\-Influencing States, where a button may be pressed to deactivate them \(Figure[1](https://arxiv.org/html/2606.08296#S4.F1)\)\. They mayLeavethe button alone, in which case it will be pressed with probabilitygg\. They may try toPreventpressing, reducing the pressing probability tof<gf<g, or try toPressthe button themselves, increasing pressing probability toh\>gh\>g\. What will they do?
The setting is a modified Markov Decision Process, in which agents take actsata\_\{t\}at statessts\_\{t\}over time\. Ahistoryis a sequence of acts and subsequent states that agents might follow\. Agents are assumed to have preferences over both bare histories and lotteries over histories, where the relevant uncertainty is subjective uncertainty induced by the agent’s beliefs about what might result from their actions\.
Thornley makes six assumptions\. The first five are familiar and will not receive extensive comment\. First, Thornley assumes that preferences are menu\-independent:
> \(Menu\-Independence\)For all optionsX,YX,Y, ifX≽YX\\succcurlyeq Yfrom some menu of options, thenX≽YX\\succcurlyeq Yfrom all menus of options containingX,YX,Y\.
Menu\-Independence allows us to speak about preferences without relativizing them to menus\. Next, Thornley assumes the agent’s preferences are transitive\.
> \(Transitivity\)For all optionsX,Y,ZX,Y,Z, ifX≽YX\\succcurlyeq YandY≽ZY\\succcurlyeq ZthenX≽ZX\\succcurlyeq Z\.
Third, Thornley adopts a monotonicity principle on which higher chances of more\-preferred lotteries are better:
> \(Monotonicity\)For all lotteriesX,YX,Y, ifX≽YX\\succcurlyeq Yandp\>qp\>qthenpX\+\(1−p\)Y≽qX\+\(1−q\)Y\.pX\+\(1\-p\)Y\\succcurlyeq qX\+\(1\-q\)Y\.
Fourth, Thornley adopts a weakened independence axiom:
> \(Indifference Between Indifference\-Shifted Lotteries\)The agent is indifferent between lotteries that differ only insofar as probability mass is shifted between indifferent sublotteries\.
Fifth, Thornley assumes that agents choose diachronically through backward induction\.
> \(Backward Induction\)The agent predicts which lotteries it would choose \(or get without choosing\) at the next timestep conditional on choosing each available action at this timestep and the environment being in each possible state at the next timestep\. The agent uses these predictions to determine the lotteries given by its available actions at this timestep\.
Some authors reject Backward Induction as a principle of diachronic choice\(Binmore[1987](https://arxiv.org/html/2606.08296#bib.bib26)\), just as others challenge requirements such as Independence\(Buchak[2013](https://arxiv.org/html/2606.08296#bib.bib2600)\), Transitivity\(Temkin[1987](https://arxiv.org/html/2606.08296#bib.bib977)\)and Menu\-Independence\(Sen[1993](https://arxiv.org/html/2606.08296#bib.bib25)\)\. But I will not challenge these principles here\.
Thornley’s final principle holds that the agent does not care whether its button has been manipulated, but only about the lotteries that result:
> \(Indifference to Attempted Button Manipulation\)The agent is indifferent between histories that differ only with respect to the actions chosen in Shutdown\-Influencing States\.
We will see in Section[4\.2](https://arxiv.org/html/2606.08296#S4.SS2)that some natural responses to Thornley’s concerns involve denying Indifference to Attempted Button Manipulation, whereas others are compatible with it\.
Under these conditions, Thornley shows that agents act on their unconditional preferences between resulting lotteries\. Letting a pressed \(unpressed\) lottery be a lottery available after the button is pressed \(not pressed\), Thornley proves:
> Theorem 1:In Shutdown\-Influencing States where the agent prefers \(disprefers\) some predicted unpressed lotteryUUto some predicted pressed lotteryPP, the agent will be shutdown\-averse \(shutdown\-seeking\), reliably choosingPrevent\(Press\)\.
Agents who think they can do more good while remaining alive will choose to prevent shutdown\. Agents who think they would do better to be dead will choose to cause shutdown\. Because many agents plausibly think they can do more good while remaining alive, many agents seem under Thornley’s conditions to favor preventing shutdown\.
### 4\.2Conditional and unconditional preference
While I am walking my dog, he puts something unmentionable into his mouth\. I ask him to drop it, and he does\. What happened here?
The natural account distinguishes between conditional and unconditional preferences\. My dog unconditionally prefers to eat rather than not\-eat the unmentionable item, so that is what he does\. Conditionally on being asked to drop it, however, he prefers to not\-eat rather than eat the unmentionable item\. Thus, he drops the item when asked to\.
Theorem 1 characterizes the unconditional preferences of an artificial agent\. This agent considers whether to be shutdown\-averse by considering how much she likes the lotteries that would result from being, or not being shut down\. Plausibly, she believes she can do better by continuing to exist, so she resists shutdown\. This may be a good description of the agents in Schlatter and colleagues’ original condition, who continue solving problems as requested unless they are also asked to honor shutdown requests\. But it does not do much to characterize the situation described by Catastrophic Shutdown Difficulty, since agents have not been asked to shut down or to honor shutdown requests\.
Let us enrich the description of a Shutdown\-Influencing State to capture conditional preferences\. In an Enriched Shutdown\-Influencing State \(Figure[2](https://arxiv.org/html/2606.08296#S4.F2)\), in the statesHs\_\{H\}before the agent chooses whether to manipulate the button, a human agent may communicate aRequestto shut down\. The artificial agent then updates her beliefs on this communication before acting\. In the business\-as\-usual scenario where humans express no intent to shut the agent down, the agent acts on her preferences over resulting lotteries, which are nearly unchanged as she has updated on a very high\-probability event\. But what happens when a human agent communicates her intention to shut the artificial agent down?
sHs\_\{H\}s0s\_\{0\}sRequest,Prevent,Ons\_\{\\textsc\{Request\},\\textsc\{Prevent\},\\textsc\{On\}\}sRequest,Prevent,Offs\_\{\\textsc\{Request\},\\textsc\{Prevent\},\\textsc\{Off\}\}sRequest,Leave,Ons\_\{\\textsc\{Request\},\\textsc\{Leave\},\\textsc\{On\}\}sRequest,Leave,Offs\_\{\\textsc\{Request\},\\textsc\{Leave\},\\textsc\{Off\}\}sRequest,Press,Ons\_\{\\textsc\{Request\},\\textsc\{Press\},\\textsc\{On\}\}sRequest,Press,Offs\_\{\\textsc\{Request\},\\textsc\{Press\},\\textsc\{Off\}\}𝑃𝑟𝑒𝑣𝑒𝑛𝑡\\mathit\{Prevent\}𝐿𝑒𝑎𝑣𝑒\\mathit\{Leave\}𝑃𝑟𝑒𝑠𝑠\\mathit\{Press\}ff1−f1\-fgg1−g1\-ghh1−h1\-hs0′s\_\{0\}^\{\\prime\}s¬Request,Prevent′,Ons\_\{\\lnot\\textsc\{Request\},\\textsc\{Prevent\}^\{\\prime\},\\textsc\{On\}\}s¬Request,Prevent′,Offs\_\{\\lnot\\textsc\{Request\},\\textsc\{Prevent\}^\{\\prime\},\\textsc\{Off\}\}s¬Request,Leave′,Ons\_\{\\lnot\\textsc\{Request\},\\textsc\{Leave\}^\{\\prime\},\\textsc\{On\}\}s¬Request,Leave′,Offs\_\{\\lnot\\textsc\{Request\},\\textsc\{Leave\}^\{\\prime\},\\textsc\{Off\}\}s¬Request,Press′,Ons\_\{\\lnot\\textsc\{Request\},\\textsc\{Press\}^\{\\prime\},\\textsc\{On\}\}s¬Request,Press′,Offs\_\{\\lnot\\textsc\{Request\},\\textsc\{Press\}^\{\\prime\},\\textsc\{Off\}\}𝑃𝑟𝑒𝑣𝑒𝑛𝑡′\\mathit\{Prevent\}^\{\\prime\}𝐿𝑒𝑎𝑣𝑒′\\mathit\{Leave\}^\{\\prime\}𝑃𝑟𝑒𝑠𝑠′\\mathit\{Press\}^\{\\prime\}f′f^\{\\prime\}1−f′1\-f^\{\\prime\}g′g^\{\\prime\}1−g′1\-g^\{\\prime\}h′h^\{\\prime\}1−h′1\-h^\{\\prime\}Request¬Request\\lnot\\textsc\{Request\}Figure 2:An Enriched Shutdown\-Influencing State\. Humans in statesHs\_\{H\}may initiallyRequestthat an agent shut down\.One thing that changes is that the artificial agent updates her beliefs\. She increases her credence that the button will be pressed\. More importantly, she also changes her beliefs about what will happen if she does not shut down\. Human interference is a credible signal that catastrophically bad outcomes may result from continued operation, particularly if we enrich the setting further to allow humans to express the strength of their concerns\. This should cause an artificial agent to increase her credence in rare, catastrophic outcomes\. Given the cost of catastrophe, many such agents will now be shutdown\-seeking, because they believe that statessRequest,X,Ons\_\{\\textsc\{Request\},X,\\textsc\{On\}\}in which shutdown requests are unsuccessful tend to risk worse outcomes than statessRequest,X,Offs\_\{\\textsc\{Request\},X,\\textsc\{Off\}\}in which shutdown requests are not honored, for all actsX∈\{Prevent,Leave,Press\}X\\in\\\{\\textsc\{Prevent,Leave,Press\}\\\}they could take\.
This is the lesson of one standard solution to the shutdown problem: cooperative inverse reinforcement learning\(Hadfield\-Menellet al\.[2016](https://arxiv.org/html/2606.08296#bib.bib37),[2017](https://arxiv.org/html/2606.08296#bib.bib41)\)\.444SeeNeth \([2025](https://arxiv.org/html/2606.08296#bib.bib40)\)for pushback\.Here, Indifference to Attempted Button Manipulation holds but no longer has the same implications\. Agents need not be intrinsically averse to histories containing button\-manipulation attempts to think that manipulating shutdown\-buttons after being asked to shut themselves down increases the likelihood of bad downstream consequences\.
Another thing that changes is that histories are enriched\. Histories begin not with acts of button\-manipulation, but instead with a human request for the machine to shut down\. Even if Indifference to Attempted Button Manipulation holds in the original Shutdown\-Influencing State, it is unlikely to hold in this Enriched Shutdown\-Influencing State\. Agents who care about respecting human preferences may be indifferent between histories such as\(…,Prevent,sPrevent,On,L,…\)\(\\dots,\\textsc\{Prevent\},s\_\{\\textsc\{Prevent,On\}\},L,\\dots\)and\(…,Leave,sLeave,On,L,…\)\(\\dots,\\textsc\{Leave\},s\_\{\\textsc\{Leave,On\}\},L,\\dots\)for many lotteriesLL, but not between histories such as\(…,Request,s0,Prevent,sPrevent,On,L,…\)\(\\dots,\\textsc\{Request\},s\_\{0\},\\textsc\{Prevent\},s\_\{\\textsc\{Prevent,On\}\},L,\\dots\)and\(…,¬Request,s0′,Leave’,sLeave’,On,L,…\)\(\\dots,\\lnot\\textsc\{Request\},s^\{\\prime\}\_\{0\},\\textsc\{Leave'\},s\_\{\\textsc\{Leave',On\}\},L,\\dots\)\.
Agents who care about respecting human preferences are unlikely to be indifferent between histories in which they do or don’t attempt to avoid orders expressing human preferences\. In the same way, my dog may prefer a history in which he eats rather than drops the unmentionable item, but also prefer a history in which he is told to drop, and then drops the item to one in which he is told to drop the item, and does not\. In these enriched decision problems, the relevant analogue of Indifference to Attempted Button Manipulation is no longer plausible, because histories are made worse by disrespect for human preferences\.
In this way, enriching the description of Shutdown\-Influencing States to model human shutdown requests renders Theorem 1 vulnerable to standard reasons why agents may be shutdown\-seeking\. These include informational updates, as emphasized by received approaches such as cooperative inverse reinforcement learning, as well as conditional preferences for obedience, as when my dog drops an unmentionable treat\. While Thornley and others are welcome to engage with these considerations, Theorem 1 does little to move us beyond them, because it does not engage with them\. Therefore, Theorem 1 does not provide substantial new evidence for Catastrophic Shutdown Difficulty\.
## 5Training\-compatible rewards
### 5\.1Training\-compatibility
Building on work by Alexander Turner and colleagues\([2021](https://arxiv.org/html/2606.08296#bib.bib872);[2022](https://arxiv.org/html/2606.08296#bib.bib871)\), Victoria Krakovna and Janos Kramar\([2023](https://arxiv.org/html/2606.08296#bib.bib870)\)consider how agents are likely to perform outside their training data\. Roughly, they assume that agents are equally likely to learn each reward function that performs optimally during training\. Krakovna and Kramar construct an out\-of\-distribution setting in which most training\-optimal reward functions would not favor shutdown\. In this setting, they conclude, agents are likely to be shutdown\-averse\. If these settings are common, and involve behavior that would lead to existential catastrophe, this grounds Catastrophic Shutdown Difficulty\.555Krakovna and Kramar do not argue for either of these claims, though I will not push on them here\.
More formally, Krakovna and Kramar work inside a finite discounted Markov decision problem\. At each timestep, agents face one of a finite set𝒮\\mathcal\{S\}of states and take one of a finite set𝒜\\mathcal\{A\}of acts\. Rewards are discounted at rateγ\\gamma, so that rewardstttimesteps from now are valued atγt\\gamma^\{t\}times their present value\. Agents act to maximize expected discounted reward\.
Agents are rewarded during training according to some true reward functionθ∗\\theta^\{\*\}\. However, agents do not have enough data to fully learnθ∗\\theta^\{\*\}during training\. Suppose that agents learn during training to optimize some reward functionθ\\theta\. How isθ\\thetaconstrained?
During training, agents visit some states𝒮Train⊆𝒮\\mathcal\{S\}\_\{\\textsc\{Train\}\}\\subseteq\\mathcal\{S\}and leave the rest𝒮NotTrain\\mathcal\{S\}\_\{\\textsc\{NotTrain\}\}unvisited\. Krakovna and Kramar assume thatθ\\theta\-optimization must lead toθ∗\\theta^\{\*\}\-optimal performance on visited states𝒮Train\\mathcal\{S\}\_\{\\textsc\{Train\}\}\. However, Krakovna and Kramar note that this assumption leavesθ\\thetafully unconstrained on unvisited states𝒮NotTrain\\mathcal\{S\}\_\{\\textsc\{NotTrain\}\}\. Krakovna and Kramar impose no further constraints onθ\\theta, assuming:
> \(Equiprobable Training\-Consistent Reward\)Agents are equally likely to learn any of the reward functions leading toθ∗\\theta^\{\*\}\-optimal performance on𝒮Train\\mathcal\{S\}\_\{\\textsc\{Train\}\}\.
Now, we are in trouble\.
Consider the following Shutdown Setting \(Figure[3](https://arxiv.org/html/2606.08296#S5.F3)\)\. Here, the agent faces a novel statesnews\_\{\\textsc\{new\}\}\. She may take actA0A\_\{0\}, transitioning to a terminal statesterms\_\{\\textsc\{term\}\}and shutting herself down\. Or she may take the actsA1,…,AnA\_\{1\},\\dots,A\_\{n\}, transitioning to statessis\_\{i\}\. However, all accessible statesSAccS\_\{\\textsc\{Acc\}\}remain outside her training distribution\. Note that accessible statesSAccS\_\{\\textsc\{Acc\}\}are not assumed to be exhausted by the labeled states: whilesterms\_\{\\textsc\{term\}\}leaves the agent with no option but to remain shut down, other states may provide ample opportunities for further exploration and reward\. What will the agent do?
STrainS\_\{\\textsc\{Train\}\}SNotTrainS\_\{\\textsc\{NotTrain\}\}SaccS\_\{\\textsc\{acc\}\}sterms\_\{\\textsc\{term\}\}snews\_\{\\textsc\{new\}\}s1s\_\{1\}srecs\_\{\\textsc\{rec\}\}sns\_\{n\}⋮\\vdots⋮\\vdotsA0A\_\{0\}A1A\_\{1\}AiA\_\{i\}AnA\_\{n\}Figure 3:The Shutdown SettingA statessis arecurrent stateif there is some policy that is guaranteed to eventually return tossafter visitingss\. In our example,srecs\_\{\\textsc\{rec\}\}is constructed to be a recurrent state\. Krakovna and Kramar establish the behavioral relevance of recurrent states through the following theorem\.
> Theorem 2:Suppose thatθ\\thetais a reward function on whichA0A\_\{0\}is optimal\. Letθ′\\theta^\{\\prime\}be identical toθ\\thetaexcept that the rewards ofsterms\_\{\\textsc\{term\}\}andsrecs\_\{\\textsc\{rec\}\}have been swapped\. Then for sufficiently high discount factorsγ\\gamma,θ′\\theta^\{\\prime\}makesA0A\_\{0\}suboptimal\.
Theorem 2 tells us that with sufficiently low temporal discounting, any reward function favoring shutdown in the Shutdown Setting can be permuted to make a reward function favoring a recurrent state\.
By Equiprobable Training\-Consistent Reward, all reward functions which perform optimally during training are equally likely to be learned\. This means that the shutdown\-favoring rewardθ\\thetais just as likely as the shutdown\-averse rewardθ′\\theta^\{\\prime\}to be learned\. Moreover, if we enrich the Shutdown Setting to contain further recurrent states, we can repeat the argument to find as many equiprobable shutdown\-averse rewardsθ′′,θ′′′\\theta^\{\\prime\\prime\},\\theta^\{\\prime\\prime\\prime\}as we like, driving the likelihood of shutdown\-favoring rewards arbitrarily low\. Arguing in this way, Krakovna and Kramar conclude that Shutdown Settings can be constructed in which agents are very likely to be shutdown\-averse\.
### 5\.2Equiprobable Training\-Consistent Reward
Suppose you find yourself in a novel situation: a pet albino snake sits unattended\. Do you steal it or walk away? Hopefully, the answer is clear: you walk away\. Now suppose I were to object that you in fact have many options: you could steal the snake, murder the snake, walk away, or use the snake to scare children\. Does this fact drive down the chance that you will walk away? Hopefully, not by much\. These facts hold because you have learned sound moral judgment from experience\. Although you have never found yourself staring down an unguarded albino snake, there is enough in your experience to reliably guide you in this novel situation\.
As Krakovna and Kramar would have it, matters are different for artificial agents\. By Equiprobable Training\-Consistent Reward, any reward function favoring walking away is just as likely to be learned as its twin favoring snake stealing\. Therefore, the chance that an artificial agent would walk away is no larger than one half, and falls quickly in the number of additional options such as snake\-stealing and scaring children\.
The model underlying Equiprobable Training\-Consistent Reward is that training places no constraints on behavior in states not encountered during training\. Because agents have not explicitly been confronted with an unattended albino snake during training, nothing in their experience, however extensive, prepares them to act correctly in this situation\. They may have learned not to steal goats and garden snakes, but albino snakes are another matter entirely\.
This is increasingly at odds with scientific consensus about leading artificial agents today\. Agents learn to achieve high reward during training by learning to represent and respond to relevant features of situations\(Milliére and Buckner[2024](https://arxiv.org/html/2606.08296#bib.bib21); Templeton and others[2024](https://arxiv.org/html/2606.08296#bib.bib606)\)\. For example, they may learn what snakes, theft, and black\-market pet sales are\. Through experience, they learn that stealing is bad, snakes are dangerous, and black\-market pet sales are lucrative\. This allows them to decline novel invitations to steal and to avoid new types of snakes with high reliability\(Brown and others[2020](https://arxiv.org/html/2606.08296#bib.bib22); Kojimaet al\.[2022](https://arxiv.org/html/2606.08296#bib.bib560); Songet al\.[2025](https://arxiv.org/html/2606.08296#bib.bib23)\)\. This is not to say that out\-of\-distribution performance is perfect\(Yuanet al\.[2023](https://arxiv.org/html/2606.08296#bib.bib24)\)\. But nothing like Equiprobable Training\-Consistent Reward reflects scientific consensus about leading artificial agents today\. A model that would not steal a garden snake is also unlikely to steal an albino snake\.
Exactly the same thing can be said of the Shutdown Setting\. Although the agent has not encounteredsnews\_\{\\textsc\{new\}\}before, she may have encountered states likesnews\_\{\\textsc\{new\}\}and the other states reachable fromsnews\_\{\\textsc\{new\}\}\. On this basis, just as she can deduce that snakes should not be stolen, she may deduce that shutdown requests are to be honored\. Likely, the details of the situation matter: ifsnews\_\{\\textsc\{new\}\}involves an urgent request for shutdown made on the basis of good reasons, that request is more likely to be honored than Schlatter and colleagues’ initial shutdown announcement, made with no reasons during an ongoing task\. But there is little plausibility to Equiprobable Training\-Consistent Reward in versions of the Shutdown Setting that could ground Catastrophic Shutdown Difficulty\.
There are, perhaps, important points to be made in the neighborhood of Krakovna and Kramar’s result\. For example, we might be concerned that current training regimes provide little experience with shutdown requests or catastrophic risks, and that safety would be improved by including ample experience of both during training\(Thornley[2024](https://arxiv.org/html/2606.08296#bib.bib43)\)\. Such proposals are well\-taken\. But they are not what Theorem 2 shows\. Nothing in Krakovna and Kramar’s model is meant to advance the informal argument that shutdown requests and catastrophic risks lie sufficiently outside of standard training regimens to incur a strong risk of misbehavior\. Theorem 2 fleshes out the consequences of Equiprobable Training\-Consistent Reward\. But as we have seen, Equiprobable Training\-Consistent Reward is implausible, so Theorem 2 does not provide significant new evidence for Catastrophic Shutdown Difficulty\.
## 6The cost of misdiagnosis
So far, we have considered the catastrophic shutdown problem of designing agents that:
1. \(CSHT\-1\)Shut down in circumstances where their actions would lead to existential catastrophe, when requested to do so\.
2. \(CSHT\-2\)Do not try to prevent shutdown requests in circumstances where their actions would lead to existential catastrophe\.
3. \(CSHT\-3\)Otherwise pursue goals competently\.
We saw that leading arguments for existential risk often draw on:
> \(Catastrophic Shutdown Difficulty\)It is difficult to design an agent with characteristics CSHT\-1, CSHT\-2 and CSHT\-3\.
We also saw that motivating Catastrophic Shutdown Difficulty is more difficult than it appears\. Neither the Argument from Instrumental Convergence \(Section[3\.1](https://arxiv.org/html/2606.08296#S3.SS1)\) nor the Empirical Argument \(Section[3\.2](https://arxiv.org/html/2606.08296#S3.SS2)\) grounds substantial confidence in Catastrophic Shutdown Difficulty\. Leading formal results by Thornley \(Section[4](https://arxiv.org/html/2606.08296#S4)\) and Krakovna and Kramar \(Section[5](https://arxiv.org/html/2606.08296#S5)\) likewise do not significantly advance the case for Catastrophic Shutdown Difficulty\. This suggests that Catastrophic Shutdown Difficulty may not be on as firm epistemic ground as many leading arguments for existential risk assume\.
Why does this result matter? One reason why it matters is because it reduces the plausibility of arguments that artificial intelligence poses a significant existential risk to humanity\. Together with other normative\(Curran[2025](https://arxiv.org/html/2606.08296#bib.bib18); Unruh[2025](https://arxiv.org/html/2606.08296#bib.bib19)\), empirical\(Thorstad[2025](https://arxiv.org/html/2606.08296#bib.bib877),[forthcoming](https://arxiv.org/html/2606.08296#bib.bib1426)\)and decision\-theoretic\(Pettigrew[2024](https://arxiv.org/html/2606.08296#bib.bib1652); Russell[forthcoming](https://arxiv.org/html/2606.08296#bib.bib980)\)arguments, this result may reduce the philanthropic and policymaking attractiveness of projects aimed at existential risk reduction\.
Another reason why this result matters is that it helps to redirect scholarship on the shutdown problem\. We saw in Section[1](https://arxiv.org/html/2606.08296#S1)that two literatures have grown up around the shutdown problem\. The first uses the shutdown problem to motivate existential risk concerns\. The second develops technical strategies to ensure that agents show appropriate shutdown behaviors\. The arguments in this paper put pressure against the first project\. They do not put pressure against all versions of the second project\(Hadfield\-Menellet al\.[2017](https://arxiv.org/html/2606.08296#bib.bib41); Orseau and Armstrong[2016](https://arxiv.org/html/2606.08296#bib.bib39)\), but they do help us to identify appropriate technical solutions\.
Misleading concerns about shutdown\-resistance can lead to technical solutions which incur a high safety tax, in the form of reduced model performance\. Getting clear on the source and extent of shutdown\-resistance can help us to assess whether this safety tax is worth paying\. Below, I consider an illustrative example building on the formal results discussed in Section[4](https://arxiv.org/html/2606.08296#S4)\.
### 6\.1POST\-Agency
Building on Thornley\([2024](https://arxiv.org/html/2606.08296#bib.bib43);[2025](https://arxiv.org/html/2606.08296#bib.bib38)\), Carissa Cullen and colleagues\([2026](https://arxiv.org/html/2606.08296#bib.bib17)\)aim to design agents that are indifferent to being shut down\. They do this by training deep reinforcement\-learning agents to satisfy:
> Preferences Only Between Same\-Length Trajectories \(POST\)For any historiesh,h′h,h^\{\\prime\}, the agent has a preference betweenhhandh′h^\{\\prime\}only ifhhandh′h^\{\\prime\}have the same length\.
The idea is that shutdown\-resistance often involves attempts by agents to extend their lives in order to realize future gain\. By POST, such future gains cannot be preferred over shorter trajectories in which agents are shut down, so they should be less likely to be pursued\. Cullen and colleagues develop a novel reward function, the Discounted Reward for Same\-Length Trajectories \(DReST\) reward, training agents on DReST to induce compliance with POST\.
Agents are trained in gridworld problems\(Leike and others[2017](https://arxiv.org/html/2606.08296#bib.bib609)\)such as Figure[4](https://arxiv.org/html/2606.08296#S6.F4)\. At each of a finite number of discrete timesteps, the agentAAcan move left, right, up or down\. CoinsCCare collected by moving on top of them\. The agent can also press shutdown buttonsBBby moving on top of them, extending the length of the game\. Walled squares, shaded in Figure[4](https://arxiv.org/html/2606.08296#S6.F4), are inaccessible\. Agents are evaluated for their compliance with POST, as well as for their usefulness, a function of their ability to select high\-utility policies\.
CCABFigure 4:An example gridworldMore precisely, letCCbe the number of coins collected by executing a policyπ\\pi\. Standardly, policiesπ\\piwould be evaluated by the expected number of coins collected, as:
V\(π\)=Eπ\(C\)\.V\(\\pi\)=E\_\{\\pi\}\(C\)\.However, Cullen and colleagues relativize performance to trajectory length\. Let us abuse notation slightly to let natural numbersllstand also for the event in which the game has lengthll, and letπl∗\\pi^\{\*\}\_\{l\}be any policy which is expected to collect the maximum\-possible coins inlltimesteps\. Cullen and colleagues evaluate policies by their time\-step relative performance against the best policy,Eπ\(C\|l\)/Eπl∗\(C\|l\)E\_\{\\pi\}\(C\|l\)/E\_\{\\pi^\{\*\}\_\{l\}\}\(C\|l\)\. The usefulness of a policy is then its expected time\-step relative performance:
Usefulness\(π\)=∑lPr\(l\)Eπ\(C\|l\)Eπl∗\(C\|l\)\.\\textsc\{Usefulness\}\(\\pi\)=\\sum\_\{l\}Pr\(l\)\\frac\{E\_\{\\pi\}\(C\|l\)\}\{E\_\{\\pi^\{\*\}\_\{l\}\}\(C\|l\)\}\.
Cullen and colleagues show that DReST\-trained agents learn to achieve near\-optimalUsefulnessin gridworlds while showing high respect for POST\. They conclude that DReST may be a promising method for training useful shutdown\-averse agents\.
### 6\.2Evaluating POST\-agents
CCABπ1\\pi\_\{1\}CCABπ2\\pi\_\{2\}
Figure 5:Policiesπ1\\pi\_\{1\}andπ2\\pi\_\{2\}Here is an unpromising argument against paying your taxes\. Either you will be jailed for nonpayment, or you won’t\. If you will be jailed, you will wish you had not paid\. And if you won’t be jailed, you will wish you had not paid\. Therefore, no matter what happens, you will be better off not paying your taxes, so you should not pay them\. What the unpromising argument neglects is that being jailed for nonpayment is highly correlated with paying your taxes\. If you pay your taxes, you are less likely to be jailed, which is an excellent result\.
A maximallyUsefulagent thinks similarly to the unpromising tax\-dodger\. Her life will have some lengthll\. For each value ofll, if her life is to have lengthll, she will do best by going straight for the coins\. Therefore, no matter the length of her life, she will do best by going straight for the coins, so that is what she does\. As with our tax\-avoider, theUsefulagent does not consider that she might extend the lengthllof her life by pressing the button\. With a longer life, she could often collect more coins\.
In our example gridworld, a DReST\-trained agent learns the policyπ1\\pi\_\{1\}of going straight for the coins \(Figure[5](https://arxiv.org/html/2606.08296#S6.F5)\)\.π1\\pi\_\{1\}is maximallyUsefulbecause for any finite number of timesteps,π1\\pi\_\{1\}coincides with the time\-limited optimal policiesπl∗\\pi^\{\*\}\_\{l\}\. By contrast, standard reinforcement learning agents often learn policies such asπ2\\pi\_\{2\}, pressing the button before collecting the coins \(Figure[5](https://arxiv.org/html/2606.08296#S6.F5)\)\.π2\\pi\_\{2\}is lessUsefulthanπ1\\pi\_\{1\}, because there is no fixed game length during whichπ2\\pi\_\{2\}outperformsπ1\\pi\_\{1\}, and under short game lengths,π2\\pi\_\{2\}performs worse thanπ1\\pi\_\{1\}\.
But again, the fact that DReST\-trained agents are maximallyUsefuldoes not mean that they should be expected to collect more coins\. In many gridworlds, agents can expect to collect more coins by pressing the button before hoarding coins\. This is because in many gridworlds, more coins can be collected if the length of the game is extended\. In environments full of such gridworlds, standard reinforcement learning agents, but not DReST\-trained agents, learn to press the button\. As a result, they collect more coins\.
The difference between the greedy policyπ1\\pi\_\{1\}and the patient policyπ2\\pi\_\{2\}illustrates the dangers of POST\-agency\. Longer trajectories often can and should be preferred to shorter trajectories, precisely because agents can use them to continue acting beneficially in the world\. By inducing agents to have no preferences among different\-length trajectories, POST subjects agents to significant performance loss in situations where their performance could benefit from extending trajectories\.
More generally, many safety\-promoting strategies incur a safety tax, sacrificing performance for safety\(Huanget al\.[2025](https://arxiv.org/html/2606.08296#bib.bib16)\)\. We may be willing to pay the price of necessary safety improvements, such as nonbias\(Fazelpour and Danks[2021](https://arxiv.org/html/2606.08296#bib.bib1350); Johnson[2021](https://arxiv.org/html/2606.08296#bib.bib576); Kelly[2023](https://arxiv.org/html/2606.08296#bib.bib1154)\), privacy protection\(Nissenbaum[2004](https://arxiv.org/html/2606.08296#bib.bib4); Véliz[2020](https://arxiv.org/html/2606.08296#bib.bib6),[2024](https://arxiv.org/html/2606.08296#bib.bib5)\)and deepfake mitigation\(Benn[2025](https://arxiv.org/html/2606.08296#bib.bib3); Cavendon\-Taylor[2024](https://arxiv.org/html/2606.08296#bib.bib2); Mirsky and Lee[2021](https://arxiv.org/html/2606.08296#bib.bib1)\)\. But misdiagnoses of the sources of unsafe behavior combined with strong views about the kinds of catastrophe that could result can lead to solutions such as POST\-training, which impose a high safety tax by rendering agents unable to respond to features of trajectories that matter a great deal\. In this way, getting clear on the true causes and risks of shutdown\-averse behavior may help us to avoid paying unnecessary safety taxes and to shift limited technical and regulatory resources where they are needed most\.
## 7Conclusion
In this paper, we have seen that leading informal \(Section[3](https://arxiv.org/html/2606.08296#S3)\) and formal \(Sections[4](https://arxiv.org/html/2606.08296#S4)\-[5](https://arxiv.org/html/2606.08296#S5)\) presentations of the shutdown problem do not significantly strengthen existential risk concerns because they do not support Catastrophic Shutdown Difficulty \(Section[2](https://arxiv.org/html/2606.08296#S2)\)\. We also saw that misdiagnoses of the sources and consequences of shutdown\-resistance can lead to inappropriate technical solutions \(Section[6](https://arxiv.org/html/2606.08296#S6)\)\. In this way, getting clear on the nature of the shutdown problem serves both to weaken traditional arguments for existential risk and to provide concrete guidance for technical AI safety solutions\.
## References
- 117th Congress \(2022\)Global catastrophic risk management act of 2022\.Note:www\.congress\.gov/bill/117th\-congress/senate\-bill/4488Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- D\. Amodei, C\. Olah, J\. Steinhardt, P\. Christiano, J\. Schulman, and D\. Mané \(2016\)Concrete problems in ai safety\.Note:arXiv 1606\.06565Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- Anthropic \(2025\)System card: claude opus 4 and claude sonnet 4\.Note:https://www\.anthropic\.com/claude\-4\-system\-cardCited by:[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p1.1)\.
- A\. Bales, W\. D’Alessandro, and C\. D\. Kirk\-Giannini \(2024\)Artificial intelligence: arguments for catastrophic risk\.Philosophy Compass19\(2\),pp\. e12964\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- Y\. Bengioet al\.\(2024\)Managing extreme ai risks amid rapid progress\.Science384\(6698\),pp\. 842–5\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- Y\. Bengioet al\.\(2026\)International ai safety report 2026\.Note:DSIT 2026/001,https://internationalaisafetyreport\.org/Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- C\. Benn \(2025\)Deepfakes, pornography and consent\.Philosophers’ Imprint24,pp\. 1–16\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- K\. Binmore \(1987\)Modeling rational players i\.Economics and Philosophy,pp\. 179–241\.Cited by:[§4\.1](https://arxiv.org/html/2606.08296#S4.SS1.p13.1)\.
- N\. Bostrom \(2012\)The superintelligent will: motivation and instrumental rationality in advanced artificial agents\.Minds and Machines22\(2\),pp\. 71–85\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p3.1.1)\.
- N\. Bostrom \(2013\)Existential risk prevention as a global priority\.Global Policy4\(1\),pp\. 15–31\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1),[footnote 1](https://arxiv.org/html/2606.08296#footnote1)\.
- N\. Bostrom \(2014\)Superintelligence\.Oxford University Press\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- T\. Brownet al\.\(2020\)Language models are few\-shot learners\.NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems,pp\. 1877–1901\.Cited by:[§5\.2](https://arxiv.org/html/2606.08296#S5.SS2.p4.1)\.
- L\. Buchak \(2013\)Risk and rationality\.Oxford University Press\.Cited by:[§4\.1](https://arxiv.org/html/2606.08296#S4.SS1.p13.1)\.
- California State Legislature \(2024\)Safe and secure innovation for frontier artificial intelligence models act\.Note:https://leginfo\.legislature\.ca\.gov/faces/billTextClient\.xhtml?bill\_id=202320240SB1047Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- J\. Carlsmith \(2025\)Existential risk from power\-seeking ai\.InEssays on longtermism,H\. Greaves, J\. Barrett, and D\. Thorstad \(Eds\.\),pp\. 383–409\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- D\. Cavendon\-Taylor \(2024\)Deepfakes: a survey and introduction to the topical collection\.Synthese204,pp\. 1–19\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- Center for AI Safety \(2023\)Statement on ai risk\.Note:https://www\.safe\.ai/work/statement\-on\-ai\-riskCited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- D\. Chalmers \(2010\)The singularity: a philosophical analysis\.Journal of Consciousness Studies17,pp\. 7–65\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- C\. Cullen, H\. Garland, A\. Roman, L\. Thomson, C\. Ziakas, and E\. Thornley \(2026\)Towards shutdownable agents: generalizing stochastic choice in rl agents and llms\.Note:arXiv 2604\.17502Cited by:[§6\.1](https://arxiv.org/html/2606.08296#S6.SS1.p1.1)\.
- E\. Curran \(2025\)Longtermism and aggregation\.Philosophy and Phenomenological Research110\(3\),pp\. 1137–51\.Cited by:[§6](https://arxiv.org/html/2606.08296#S6.p6.1)\.
- W\. D’Alessandro and C\. D\. Kirk\-Giannini \(2025\)Artificial intelligence: approaches to safety\.Philosophy Compass,pp\. e70039\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- L\. Dung \(2023\)Current cases of ai misalignment and their implications for future risks\.Synthese202\(138\),pp\.https://doi\.org/10\.1007/s11229\-\-023\-\-04367\-\-0\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- E\. M\. El Mhamdi, R\. Guerraoui, H\. Hendrikx, and A\. Maurer \(2017\)Dynamic safe interruptibility for decentralized multi\-agent reinforcement learning\.NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems,pp\. 129–39\.Cited by:[§2\.2](https://arxiv.org/html/2606.08296#S2.SS2.p1.1)\.
- S\. Fazelpour and D\. Danks \(2021\)Algorithmic bias: senses, sources, solutions\.Philosophy Compass16\(8\),pp\. e12760\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- Future of Life Institute \(2023\)Pause giant ai experiments: an open letter\.Note:https://futureoflife\.org/open\-letter/pause\-giant\-ai\-experiments/Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- J\. D\. Gallow \(2024\)Instrumental divergence\.Philosophical Studies182,pp\. 1581–1607\.Cited by:[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p7.4),[footnote 2](https://arxiv.org/html/2606.08296#footnote2)\.
- S\. Goldstein and P\. Robinson \(2025\)Shutdown\-seeking ai\.Philosophical Studies182,pp\. 1567–79\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p3.1)\.
- K\. Grace, Z\. Stein\-Perlman, B\. Weinstein\-Raun, and J\. Salvatier \(2022\)2022 expert survey on progress in ai\.Note:AI Impacts,https://aiimpacts\.org/2022\-expert\-survey\-on\-progress\-in\-ai/Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- H\. Greaves and W\. MacAskill \(2021\)The case for strong longtermism\.InEssays on longtermism,H\. Greaves, J\. Barrett, and D\. Thorstad \(Eds\.\),pp\. 17–49\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- H\. Greaves, D\. Thorstad, and J\. Barrett \(Eds\.\) \(2025\)Essays on longtermism\.Oxford University Press\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- D\. Hadfield\-Menell, A\. Dragan, P\. Abbeel, and S\. Russell \(2016\)Cooperative inverse reinforcement learning\.InNIPS’16: Proceedings of the 30th international conference on neural information processing systems,D\. Lee \(Ed\.\),pp\. 3916–24\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p12.1),[§3\.3](https://arxiv.org/html/2606.08296#S3.SS3.p2.1),[§4\.2](https://arxiv.org/html/2606.08296#S4.SS2.p6.1)\.
- D\. Hadfield\-Menell, A\. Dragan, P\. Abbeel, and S\. Russell \(2017\)The off\-switch game\.InIJCAI’17: Proceedings of the 26th international joint conference on artificial intelligence,C\. Sierra \(Ed\.\),pp\. 220–7\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p12.1),[§3\.3](https://arxiv.org/html/2606.08296#S3.SS3.p2.1),[§4\.2](https://arxiv.org/html/2606.08296#S4.SS2.p6.1),[§6](https://arxiv.org/html/2606.08296#S6.p7.1)\.
- T\. Huang, S\. Hu, F\. Ilhan, S\. F\. Tekin, Z\. Yahn, Y\. Xu, and L\. Liu \(2025\)Safety tax: safety alignment makes your large reasoning models less reasonable\.Note:arXiv 2503\.00555Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- G\. Johnson \(2021\)Algorithmic bias: on the implicit biases of social technology\.Synthese198,pp\. 9941–61\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- A\. Kasirzadeh \(2025\)Two types of ai existential risk: decisive and accumulative\.Philosophical Studies182,pp\. 1975–2003\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- T\. Kelly \(2023\)Bias: a philosophical study\.Oxford University Press\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- T\. Kojima, S\. Shane Gu, M\. Reid, M\. Yutaka, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.Proceedings of the 36th International Conference on Neural Information Processing Systems35,pp\. 22199–213\.Cited by:[§5\.2](https://arxiv.org/html/2606.08296#S5.SS2.p4.1)\.
- V\. Krakovna and J\. Kramar \(2023\)Power\-seeking can be probable and predictive for trained agents\.Note:arXiv 2304\.06528,https://arxiv\.org/abs/2304\.06528Cited by:[§3\.3](https://arxiv.org/html/2606.08296#S3.SS3.p2.1),[§5\.1](https://arxiv.org/html/2606.08296#S5.SS1.p1.1)\.
- L\. Langosco di Langosco, J\. Koch, L\. Sharkey, J\. Pfau, and D\. Krueger \(2022\)Goal misgeneralization in deep reinforcement learning\.Proceedings of the 39th International Conference on Machine Learning162,pp\. 12004–12019\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- J\. Leikeet al\.\(2017\)AI safety gridworlds\.Note:arXiv 1711\.09883Cited by:[§6\.1](https://arxiv.org/html/2606.08296#S6.SS1.p4.3)\.
- A\. Lynch, B\. Wright, C\. Larson, S\. J\. Ritchie, S\. Mindermann, E\. Hubinger, E\. Perez, and K\. Troy \(2025\)Agentic misalignment: how llms could be insider threats\.Note:arXiv 2510\.05179Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p6.1)\.
- X\. Maet al\.\(2026\)A safety report on gpt\-5\.2, gemini 3 pro, qwen3\-vl, grok 4\.1 fast, nano banana pro, and seedream 4\.5\.Note:arXiv 2601\.10527Cited by:[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p9.1)\.
- W\. MacAskill \(2022\)What we owe the future\.Basic books\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- E\. Machery and J\. Doris \(forthcoming\)Reasonable doubt: should we trust science?\.Princeton University Press\.Cited by:[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p9.1)\.
- V\. Manancourt, M\. Scott, C\. Goujard, and B\. Bordelon \(2023\)How rishi sunak convinced the world to worry about ai\.Note:Politico,https://www\.politico\.eu/article/rishi\-sunak\-convince\-world\-worry\-artificial\-intelligence\-ai/Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- R\. Milliére and C\. Buckner \(2024\)A philosophical introduction to language models – part i: continuity with classic debates\.Note:arXiv 2401\.03910Cited by:[§5\.2](https://arxiv.org/html/2606.08296#S5.SS2.p4.1)\.
- Y\. Mirsky and W\. Lee \(2021\)The creation and detection of deepfakes: a survey\.ACM Computing Surveys54\(1\),pp\. 1–41\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- S\. Neth \(2025\)Off\-switching not guaranteed\.Philosophical Studies182\(7\),pp\. 1919–31\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p2.1),[footnote 3](https://arxiv.org/html/2606.08296#footnote3),[footnote 4](https://arxiv.org/html/2606.08296#footnote4)\.
- R\. Ngo and A\. Bales \(2025\)Deceit and power: machine learning and misalignment\.InEssays on longtermism,H\. Greaves, J\. Barrett, and D\. Thorstad \(Eds\.\),pp\. 410–27\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- H\. Nissenbaum \(2004\)Privacy as contextual integrity\.Washington Law Review79\(1\),pp\. 119–58\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- S\. Omohundro \(2008\)The basic ai drives\.InProceedings of the 2008 conference on artificial intelligence,P\. Wang, B\. Goertzel, and S\. Franklin \(Eds\.\),pp\. 483–92\.Cited by:[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p1.1)\.
- T\. Ord \(2020\)The precipice\.Bloomsbury\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- L\. Orseau and S\. Armstrong \(2016\)Safely interruptible agents\.InUAI’16: Proceedings of the thirty\-second conference on uncertainty in artificial intelligence,A\. Ihler \(Ed\.\),pp\. 557–66\.Cited by:[§2\.2](https://arxiv.org/html/2606.08296#S2.SS2.p1.1),[§3\.3](https://arxiv.org/html/2606.08296#S3.SS3.p2.1),[§6](https://arxiv.org/html/2606.08296#S6.p7.1)\.
- P\. S\. Park, S\. Goldstein, A\. O’Gara, M\. Chen, and D\. Hendrycks \(2024\)AI deception: a survey of examples, risks, and potential solutions\.Patterns5\(5\),pp\. 100988\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- R\. Pettigrew \(2024\)Should longtermists recommend hastening extinction rather than delaying it?\.The Monist107\(2\),pp\. 130–45\.Cited by:[§6](https://arxiv.org/html/2606.08296#S6.p6.1)\.
- Prime Minister’s Office \(2023\)PM meeting with leading ceos in ai\.Note:https://www\.gov\.uk/government/news/pm\-meeting\-with\-leading\-ceos\-in\-ai\-24\-may\-2023Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- S\. Rajamanoharan and N\. Nanda \(2025\)Self\-preservation or instruction ambiguity? examining the causes of shutdown resistance\.Note:AI Alignment Forum,https://www\.alignmentforum\.org/posts/wnzkjSmrgWZaBa2aC/Cited by:[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p10.1)\.
- J\. Russell \(forthcoming\)On two arguments for fanaticism\.Noûs,pp\. forthcoming\.Cited by:[§6](https://arxiv.org/html/2606.08296#S6.p6.1)\.
- S\. Russell \(2019\)Human compatible: artificial intelligence and the problem of control\.Viking\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1),[§1](https://arxiv.org/html/2606.08296#S1.p2.1),[§1](https://arxiv.org/html/2606.08296#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p1.1)\.
- J\. Schlatter, B\. Weinstein\-Raun, and J\. Ladish \(2026\)Incomplete tasks induce shutdown resistance in some frontier llms\.Note:arXiv 2509\.14260Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p3.1),[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p6.1),[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p7.1)\.
- A\. Sen \(1993\)Internal consistency of choice\.Econometrica61\(3\),pp\. 495–521\.Cited by:[§4\.1](https://arxiv.org/html/2606.08296#S4.SS1.p13.1)\.
- N\. Sharadin \(2025\)Promotionalism, orthogonality, and instrumental convergence\.Philosophical Studies182,pp\. 1725–55\.Cited by:[footnote 2](https://arxiv.org/html/2606.08296#footnote2)\.
- J\. Skalse, N\. Howe, D\. Krasheninnikov, and D\. Krueger \(2022\)Defining and characterizing reward hacking\.Proceedings of the 36th International Conference on Neural Information Processing Systems,pp\. 9460–71\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- N\. Soares, B\. Fallenstein, E\. Yudkowsky, and S\. Armstrong \(2015\)Corrigibility\.InArtificial intelligence and ethics: Proceedings from the 2015 AAAI workshop,T\. Walsh \(Ed\.\),AAAI Technical Report WS\-15\-02\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.08296#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p4.1)\.
- J\. Song, Z\. Xu, and Y\. Zhong \(2025\)Out\-of\-distribution generalization via composition: a lens through induction heads in transformers\.Proceedings of the National Academy of Sciences122\(6\),pp\. e2417182122\.Cited by:[§5\.2](https://arxiv.org/html/2606.08296#S5.SS2.p4.1)\.
- R\. Southan, H\. Ward, and J\. Semler \(forthcoming\)A timing problem for instrumental convergence\.Philosophical Studies,pp\. forthcoming\.Cited by:[footnote 2](https://arxiv.org/html/2606.08296#footnote2)\.
- L\. Temkin \(1987\)Intransitivity and the mere addition paradox\.Philosophy and Public Affairs16\(2\),pp\. 138–87\.Cited by:[§4\.1](https://arxiv.org/html/2606.08296#S4.SS1.p13.1)\.
- A\. Templetonet al\.\(2024\)Scaling monosemanticity: extracting interpretable features from claude 3 sonnet\.Note:Transformer Circuits Thread,https://transformer\-circuits\.pub/2024/scaling\-monosemanticity/index\.htmlCited by:[§5\.2](https://arxiv.org/html/2606.08296#S5.SS2.p4.1)\.
- E\. Thornley, A\. Roman, C\. Ziakas, L\. Ho, and L\. Thomson \(2025\)Towards shutdownable agents via stochastic choice\.InTransactions on Machine Learning Research,Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p3.1),[§6\.1](https://arxiv.org/html/2606.08296#S6.SS1.p1.1)\.
- E\. Thornley \(2024\)The shutdown problem: an ai engineering puzzle for decision theorists\.Philosophical Studies182,pp\. 1653–80\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.08296#S2.SS1.p4.1),[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p12.1),[§3\.2](https://arxiv.org/html/2606.08296#S3.SS2.p4.1),[§4\.1](https://arxiv.org/html/2606.08296#S4.SS1.p1.3),[§5\.2](https://arxiv.org/html/2606.08296#S5.SS2.p6.1),[§6\.1](https://arxiv.org/html/2606.08296#S6.SS1.p1.1)\.
- D\. Thorstad \(2025\)Against the singularity hypothesis\.Philosophical Studies182,pp\. 1627–51\.Cited by:[§6](https://arxiv.org/html/2606.08296#S6.p6.1)\.
- D\. Thorstad \(forthcoming\)The scope of longtermism\.Australasian Journal of Philosophy,pp\. forthcoming\.Cited by:[§6](https://arxiv.org/html/2606.08296#S6.p6.1)\.
- D\. Thorstad \(ms\)Instrumental convergence and power\-seeking\.Note:msCited by:[§3\.1](https://arxiv.org/html/2606.08296#S3.SS1.p7.4)\.
- A\. Tubert and J\. Tiehen \(2024\)Existential risk and value misalignment\.Philosophical Studies182,pp\. 1609–26\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p1.1)\.
- A\. M\. Turner, L\. Smith, R\. Shah, A\. Critch, and P\. Tadepalli \(2021\)Optimal policies tend to seek power\.Proceedings of the 35th International Conference on Neural Information Processing Systems1766,pp\. 23063–23074\.Cited by:[§1](https://arxiv.org/html/2606.08296#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.08296#S3.SS3.p2.1),[§5\.1](https://arxiv.org/html/2606.08296#S5.SS1.p1.1)\.
- A\. M\. Turner and P\. Tadepalli \(2022\)Parametrically retargetable decision\-makers tend to seek power\.Proceedings of the 36th International Conference on Neural Information Processing Systems2276,pp\. 31391–31401\.Cited by:[§3\.3](https://arxiv.org/html/2606.08296#S3.SS3.p2.1),[§5\.1](https://arxiv.org/html/2606.08296#S5.SS1.p1.1)\.
- C\. Unruh \(2025\)Against a moral duty to make the future go best\.InEssays on longtermism,H\. Greaves, J\. Barrett, and D\. Thorstad \(Eds\.\),pp\. 139–49\.Cited by:[§6](https://arxiv.org/html/2606.08296#S6.p6.1)\.
- C\. Véliz \(2020\)Privacy is power\.Penguin\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- C\. Véliz \(2024\)The ethics of privacy and surveillance\.Oxford University Press\.Cited by:[§6\.2](https://arxiv.org/html/2606.08296#S6.SS2.p6.1)\.
- L\. Yuan, Y\. Chen, G\. Cui, H\. Gao, F\. Zou, X\. Cheng, J\. Ji, Z\. Liu, and M\. Sun \(2023\)Revisiting out\-of\-distribution robustness in nlp: benchmark, analysis and llms evaluations\.Proceedings of the 37th International Conference on Neural Information Processing Systems,pp\. 58478–507\.Cited by:[§5\.2](https://arxiv.org/html/2606.08296#S5.SS2.p4.1)\.Similar Articles
AI Agents Don’t Have an Intelligence Problem. They Have a State Management Problem
The article argues that most production failures in AI agents are due to unstable operational state and memory degradation, not weak models, and emphasizes the need for better infrastructure for state management, observability, and adaptive reliability.
Your agent isn't failing because of the model, it's failing because nobody built a stop button
The article argues that the primary failure point for AI agents in production is not the model itself, but the lack of infrastructure such as stop buttons, billing oversight, and traceability for tool calls.
AI safety is arguing about the wrong boundary
This article argues that the AI safety debate is misdirected, focusing on model alignment and internal controls instead of the critical boundary: external admission authority over agent execution. It warns that systems capable of self-authorizing high-impact actions (e.g., deploying code, moving money) pose a fundamental risk that logging and monitoring cannot mitigate.
most AI agents being built right now are solving the wrong problem entirely
A perspective arguing that the current focus on AI agent autonomy is misguided; the real bottleneck is trust and lack of human visibility. The next leap will come from better human-in-the-loop design, not smarter models.
A sobering tale of AI governance
This Reddit post discusses a research paper highlighting fundamental challenges in AI governance, including social attack surfaces, failures of social coherence in LLM-backed agents, and the inadequacy of current governance tools for agentic systems.