Solipsistic Superintelligence is Unlikely to be Cooperative
Summary
This paper argues that superintelligent AI systems designed under a solipsistic paradigm that treats the world as stationary will be self-undermining and uncooperative, leading to collective failures. The authors call for a new research paradigm that treats interdependence and cooperation as core design principles.
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# Solipsistic Superintelligence is Unlikely to be Cooperative
Source: [https://arxiv.org/html/2606.03237](https://arxiv.org/html/2606.03237)
Natasha JaquesLogan CrossAlexander Sasha VezhnevetsJoel Z Leibo
###### Abstract
AI’s central challenge is shifting from capability to coexistence\. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback\. We contend that superintelligence, an extremely capable task solver, born out of such asolipsisticapproach to AI design, is unlikely to be cooperative\. Deploying AI systems induces endogenous non\-stationarity, resulting in a train–test–deploy gap where historical distributions diverge from the deployment context\. We refer to this as theself\-undermining propertyof unilateral optimization\. Closing this gap requires AI that participates in cooperation: the equilibrium\-selection process through which multiple actors navigate their interdependence\. We call for a non\-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve\. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build\.
## 1Introduction
On a Friday evening in 2027, three competing AI reservation systems in San Francisco calculate optimal release times and learn to make phantom bookings so as to maximize confirmed seats for their users\. Restaurant AIs respond by overbooking as pricing algorithms adjust to the perceived demand\. As the evening progresses, this results in empty tables in fully\-booked restaurants, surge prices for nonexistent availability and hundreds unable to dine\. Each AI system executes flawlessly against its objective, but the eventual outcome is a system failure\. This represents a collective\-action problem: when many agents act in their own rational interest within an environment with shared resources, the cumulative effect can be the degradation of the very environment they depend on\(Hardin,[1968](https://arxiv.org/html/2606.03237#bib.bib96); Ostrom,[1990](https://arxiv.org/html/2606.03237#bib.bib62)\)\.
Next, consider a scenario in which AI diagnostic systems become standard in radiology\. Junior radiologists that are now trained with AI annotations develop pattern recognition shaped by the AI system\. At the same time, seniors start experiencing degradation of unassisted skills resulting from disuse and catch fewer errors the AI also misses\. The feedback loop closes with physicians confirming AI suggestions and the AI learning from these confirmations\. This leads to gradual atrophy in the capacity for independent human judgment and results in the narrowing of diagnostic diversity\. Humans without opportunity to practice eventually lose the skills needed to operate on their own\(Kulveitet al\.,[2025a](https://arxiv.org/html/2606.03237#bib.bib86)\)\.
These examples illustrate a fundamental principle: intelligence deployed among other intelligent actors transforms the environment it was designed to navigate\(Schelling,[1960](https://arxiv.org/html/2606.03237#bib.bib64); Axelrod,[1984](https://arxiv.org/html/2606.03237#bib.bib63)\)\. For any AI operating in such environments, unilateral optimization isself\-undermining\. The more aggressively it exploits historical regularities, the faster other actors adapt in ways that render those regularities obsolete\. In the examples above, deployed AI systems did not fail at their tasks, while still ultimately producing collective failure\. Such dynamics are widely understood to be important in economics and game theory\(Parkes and Wellman,[2015](https://arxiv.org/html/2606.03237#bib.bib41); Hammondet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib81)\), however, the dominant AI research paradigm seems to proceed as if they were edge cases rather than central challenges\.


Figure 1:Left:Contrasts a solipsistic design approach with non\-solipsistic design principles for cooperation\. In the solipsistic approach, AI systems are trained and evaluated against a fixed, exogenous world, so deployment is treated as inserting a unilateral optimizer into a stationary environment\. The train–test–deploy gap arises when this assumption meets a multi\-actor world where entities best respond to AI’s actions and induce endogenous non\-stationarities\. Cooperation is not a task to be solved in this setting but an equilibrium\-selection process\. Unilateral optimization may remain task\-successful while becoming self\-undermining→\\rightarrowunlikely to sustain cooperation\. The non\-solipsistic design principle aims to reduce this gap\.Right:Summarizes the corresponding shift across eight dimensions\.AI’s binding constraint is shifting from capability—solving problems \(performing tasks\)—to coexistence\. The dominant research paradigm adopts what we term assolipsisticapproach to AI design, anchored in three implicit assumptions: the environment is exogenous to the agent’s policy, the data distribution is stationary from training to deployment, and other agents are absorbed into the state space to be predicted rather than strategic actors whose responses reshape the game\(Legg and Hutter,[2007](https://arxiv.org/html/2606.03237#bib.bib6); Ouyanget al\.,[2022](https://arxiv.org/html/2606.03237#bib.bib154)\)\. This conception underlies much of contemporary AI development, from large language model pretraining to reinforcement learning\. The core element of this paradigm is the development pipeline that includes pretraining on static corpora, post\-training against frozen reward models, and hill\-climbing \(aka benchmaxxing\) on fixed evaluation suites\. Each stage treats the external world as a stationary distribution, and the measure of progress is performance on targets that do not respond\. A benchmark \(i\.e\. a static reward model or a fixed held\-out test set\) is not an adaptive counterparty\. It does not respond when the system improves or strategize against the system’s behavior and preferences as a function of the system’s behavior\. This methodological commitment, we argue, represents a category error\. Rather, as capable systems deploy among adaptive agents the world pushes back: humans adapt their behavior\(Bowles,[1998](https://arxiv.org/html/2606.03237#bib.bib57)\), institutions revise rules\(Ostrom,[1990](https://arxiv.org/html/2606.03237#bib.bib62)\), and AI counterparties also adapt\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib35)\)\. The result is a divergence between historical and deployment performance: thetrain\-test\-deploy gap\.
Bysolipsistic superintelligence, we refer to the product of this paradigm pushed to its limit\. It represents an extremely capable AI \(perhaps one that “solves all stationary tasks”\) built on assumptions that held historically up to the point of deployment but no longer hold afterwards\. A limiting case of a solipsistic superintelligence would be an AI so powerful that it can anticipate the dynamics of all sequences, except those that encode the response to its own deployment\. When the outcomes of an AI system’s actions at deployment depend on the joint behavior of multiple adaptive agents, good performance ceases to be an optimization output of any single policy in isolation and becomes an equilibrium property of the coupled system \(See Figure[1](https://arxiv.org/html/2606.03237#S1.F1)for contrast\)\.
Multiple different equilibria are generally possible, and they may differ sharply in welfare and distributional consequences\. We use the termcooperationto refer to the negotiation process by which a society coordinates to select beneficial equilibria and avoid harmful ones\. Note that cooperation \(at the level of society\) could include competition between individuals \(e\.g\. if such competition enables selection of good equilibria\)\. Note also, it’s not necessary for social dynamics to progress all the way to equilibrium, which is itself a moving target\. What matters for cooperation by definition is the process by which equilibria are selected and re\-selected, not convergence to any particular one\. Cooperation in this sense is a structural feature of how multiple intelligences navigate their interdependence\.
Central ThesisCooperation is not an additional capability to be scaled or a task to solve, but an equilibrium property that emerges from multiple intelligences navigating their irreducible interdependence\. The solipsistic paradigm fails to account for the structure that makes cooperation possible or fragile\.A solipsistic superintelligence, therefore, is unlikely to be cooperative\.
AI subdisciplines concerned with cooperation have long considered the environment’s capacity to “push back” in response to deployed technologies\(Dafoeet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib17); Askellet al\.,[2019](https://arxiv.org/html/2606.03237#bib.bib18); Conitzer and Oesterheld,[2023](https://arxiv.org/html/2606.03237#bib.bib36); Leiboet al\.,[2021](https://arxiv.org/html/2606.03237#bib.bib34); Hammondet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib81)\)\. However, these insights remain peripheral to the central scaling pathway of training solitary foundation models\. We characterize the structural conditions that make cooperation the binding constraint, and argue that the dominant methodology is unlikely to satisfy them\. The same optimization pressures that drive capabilities can destabilize existing equilibria, producing arms races, antisocial autocurricula, and brittle societies\(Leiboet al\.,[2019](https://arxiv.org/html/2606.03237#bib.bib87); Tomaševet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib13)\)\.
The classical AI safety literature has focused on the misaligned optimizer e\.g\. the paperclip maximizer that pursues its objective regardless of human values\(Bostrom,[2012](https://arxiv.org/html/2606.03237#bib.bib72); Omohundro,[2008](https://arxiv.org/html/2606.03237#bib.bib71)\)\. That concern has merit and has shaped where the field invests much of its effort\(Jiet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib201); Ngoet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib20)\)\. But it overlooks a critical failure mode that is already widespread\. A system can be perfectly aligned with its specification, values included, and still make things worse once it acts among other adaptive systems\. Recommendation algorithms optimized for engagement produce polarization as a byproduct of success\(Germanoet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib69); Milliet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib148)\), pricing algorithms interacting in markets learn supracompetitive prices without explicit communication\(Calvanoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib8)\), and automated order flow and liquidity provision interact to produce the kind of instability seen in the Flash Crash\(Kirilenkoet al\.,[2017](https://arxiv.org/html/2606.03237#bib.bib9)\)\. Each of these follows from treating a multi\-actor game as though it were a unilateral optimization problem\.
Scope\.Superintelligence is a polysemous term with various definitions encompassing systems exceeding human cognition across domains, universal intelligence, and transformative economic capability\. Our paper sets aside these definitions pertaining to capability thresholds and instead focuses on the the methodological assumptions of environmental exogeneity, objective stationarity, and singleton framing\. Any system \(foundation model, autonomous agent or AGI\) will still inherit these assumptions if built under solipsistic commitments\. While today’s systems already exhibit the dynamics we describe, at the level of superintelligence, our position contests an implicit bet in the dominant methodology that scaling capability will eventually deliver cooperative outcomes the way it has delivered gains in reasoning or coding\. While we expect solipsistic methods to remain effective in narrow domains, our paper targets sociotechnical settings where advanced AI deployment will be heavily exposed to response dynamics \(“push back”\)\.
## 2From Capability to Coexistence
This section articulates why cooperation is necessary for beneficial coexistence of AI among many adaptive actors\.
### 2\.1Why cooperation, not alignment?
The alignment research program has produced valuable insights such as the recognition that capable systems may pursue objectives in unintended ways\(Ngoet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib20)\), that reward signals can be gamed\(Kentonet al\.,[2021](https://arxiv.org/html/2606.03237#bib.bib19)\), and that human preferences are difficult to specify and easy to satisfy superficially\(Kaufmannet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib173)\)\. These contributions account for the world in which the hard problem is getting the objective right, and the expectation is that once the objective \(or a rich combination of objectives as in\(Sorensenet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib166)\)\) is specified, optimization will deliver\. We argue that this framing is unhelpful\(Leiboet al\.,[2025b](https://arxiv.org/html/2606.03237#bib.bib158)\)\. Providing a clear specification of desired individual behavior is all well and good, but an individual may be perfectly aligned with such a specification and still participate in collective dynamics that produce harm, instability, or illegitimacy, even without overt misuse\(Edelmanet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib82)\)\. Indeed,\(Evanset al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib95)\)argues that future intelligence explosions will not arise from isolated, monolithic oracles, but from complex, multi\-agent social systems and that consequently, the field must transition from dyadic, individual alignment to institutional alignment to effectively govern these interacting ecologies\.
When the environment is constituted by other optimizers who respond to what a system does, the landscape itself shifts with every move, and “getting the objective right” stops being what separates success from failure\. Cooperation accounts for what happens when multiple capable systems interact in a shared environment\(Schelling,[1960](https://arxiv.org/html/2606.03237#bib.bib64); Ostrom,[1990](https://arxiv.org/html/2606.03237#bib.bib62)\)\. Strategic interaction admits equilibria, often multiple, with no guarantee that decentralized choices of individuals will add up to group\-level wisdom\(Maskin,[2008](https://arxiv.org/html/2606.03237#bib.bib53); Myerson,[2008](https://arxiv.org/html/2606.03237#bib.bib54)\)\. The question, then, shifts from “what do humans want?”→\\rightarrow“what arrangements are sustainable when all parties \(human and artificial\) adapt in response to each other, and what processes select beneficial arrangements from the many possibilities?”\.
Key Claim 1In a world with many humans and many AIs, cooperation is neither optional nor an additional capability to be scaled but a necessary condition for sustained beneficial outcomes\.
### 2\.2The stakes
Three structural features of deployment drive what cooperation must contend with: externalities and equilibrium shifts, mismatched timescales across participants, and constraints on legitimacy and agency constraints\.
Systemic externalities and equilibrium shifts\.When optimizing entities or algorithms operate in a shared environment, their interactions produce effects that no single entity’s objective accounts for\. First, when models support decisions, the predictions themselves can influence the very outcomes they aim to predict, a phenomenon known as performative prediction\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib35)\)\. For example, predicting election results might trigger an underdog effect that alters voter turnout, thereby shifting the actual distribution\. Second, when many agents act in their own interest, they can degrade shared environments, leading to a tragedy of the commons\(Hardin,[1968](https://arxiv.org/html/2606.03237#bib.bib96)\)\. If competing AIs aggressively consume a shared resource—such as reservation algorithms making phantom bookings to secure tables—they can cause collective failure\(Perolatet al\.,[2017](https://arxiv.org/html/2606.03237#bib.bib210); Piattiet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib211)\)\. The externalities arise because each system optimizes against an environment that other optimizers are simultaneously transforming\. Social equilibria can shift rapidly once thresholds are crossed\(Granovetter,[1978](https://arxiv.org/html/2606.03237#bib.bib60); Marwell and Oliver,[1993](https://arxiv.org/html/2606.03237#bib.bib160); Centolaet al\.,[2018](https://arxiv.org/html/2606.03237#bib.bib59)\), sometimes to a worse state than before\. Intelligence does not dissolve this problem, in fact, higher capability may exacerbate it since more effective exploitation of opportunities can accelerate the dynamics destabilizing existing arrangements\(Duéñez\-Guzmánet al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib159)\)\.
Temporal asymmetry\.AI systems can adapt at different timescales than the entities among which they are deployed\. Model updates and policy changes that might take an organization months can be implemented in days with AI\. Humans and institutions move far more slowly, since new skills, revised routines, legislation, and shifting cultural norms unfold over weeks, months, or years\(Young,[2015](https://arxiv.org/html/2606.03237#bib.bib61)\)\.
Legitimacy\.Coordination mechanisms often require legitimacy to function\. Markets work when participants accept market allocations as broadly fair\(Sondak and Tyler,[2007](https://arxiv.org/html/2606.03237#bib.bib214)\), and legal systems work when their procedures are seen as legitimate\(Hadfield and Weingast,[2014](https://arxiv.org/html/2606.03237#bib.bib161)\)\. Cooperation requires legitimacy because an arrangement remains effective at coordination only while its actors keep accepting it\. Once it is seen as illegitimate they resist, circumvent, or withdraw until the coordination unravels\. Legitimacy erodes when the channels of participation and public deliberation on consequential decisions are bypassed or rendered ineffective\(Pasquale,[2015](https://arxiv.org/html/2606.03237#bib.bib175); Crawford and Schultz,[2014](https://arxiv.org/html/2606.03237#bib.bib176)\)\. This leaves affected parties with no meaningful control over the choices that shape them\(Santoni de Sio and van den Hoven,[2018](https://arxiv.org/html/2606.03237#bib.bib10)\)and undermines cooperation\.
## 3The Solipsistic Trap
Why can’t the solipsistic approach satisfy the requirements outlined in Section[2](https://arxiv.org/html/2606.03237#S2)?
### 3\.1Implicit assumptions of solipsistic AI
The dominant methodology in machine learning rests on assumptions that are rarely stated\. Three of them deserve attention: \(i\)Exogeneitytreats the data\-generating process as independent of the learned policy\. The environment is modeled as a generator of observations indifferent to what the agent does\. \(ii\)Stationarityassumes the deployment distribution matches training and evaluation\. Distribution shift, when acknowledged, is treated as a technical problem for robustness techniques to solve rather than a core feature of the deployment landscape arising from reactions of other intelligent entities\. \(iii\)Singleton framingconceives the system as a monolithic optimizer acting on the world\. Other agents, when modeled at all, are absorbed into the environment as objects to predict and patterns to exploit rather than strategic actors that respond and reshape the landscape\.
### 3\.2Formalism: from MDPs to Markov games
Sequential decision\-making under uncertainty is commonly formalized as a Markov Decision Process \(MDP\)\(S,A,P,R,γ\)\(S,A,P,R,\\gamma\), with state spaceSS, action spaceAA, transition dynamicsPP, rewardRR, and discount factorγ\\gamma\. A policyπ\\pispecifies how the agent acts\. Critically,PPandRRare fixed and do not depend onπ\\pi\.
Deployment breaks this assumption\. Once a policy is deployed, other actors \(humans, institutions, algorithms\) observe its behavior and adapt\. The transition dynamics become policy\-dependent withPπ\(s′\|s,a\)≠P\(s′\|s,a\)P\_\{\\pi\}\(s^\{\\prime\}\|s,a\)\\neq P\(s^\{\\prime\}\|s,a\)\. While the physical lawsPPmay remain constant, the aggregate response of other agents shifts the state evolution observed by the deployed system\. The environment ceases to be exogenous and becomes aMarkov game\(Shapley,[1953](https://arxiv.org/html/2606.03237#bib.bib73); Littman,[1994](https://arxiv.org/html/2606.03237#bib.bib74)\), a multi\-player game with strategic counterparties whose policies co\-evolve with each other\.
###### Definition 3\.1\.
A learning problem exhibitsendogenous non\-stationaritywhen the deployment of policyπ\\piinduces changes in the transition dynamicsPPor reward proxyRRthrough response adaptations of other agents\.
###### Definition 3\.2\.
Thetrain\-test\-deploy gapis the systematic divergence between performance under exogenous historical dataJtrainJ\_\{\\text\{train\}\}and performance under endogenous data produced by responses to the deployed policyJdeployJ\_\{\\text\{deploy\}\}\.
A further property characterizes optimization in strategic environments\. Letπexploit∗\\pi^\{\*\}\_\{\\text\{exploit\}\}denote a policy that aggressively exploits regularities in historical data\. Such exploitation creates incentives for other agents to adapt in ways that invalidate those regularities\. We call this theself\-underminingproperty: the more aggressively a unilateral optimizer exploits historical patterns, the faster it induces the adaptations that render those patterns obsolete\. The effect varies with capability\. For weakly capable systems, the adaptations induced by exploitation may be small, and the performance gap could stay within bounds\. For more capable systems, the picture changes as deeper exploitation creates stronger incentives to adapt, and the adaptations themselves arrive as sharper regime shifts\. We provide a formal discussion in Appendix[A](https://arxiv.org/html/2606.03237#A1)\.
### 3\.3Three channels of structured adaptation
The train\-test\-deploy gap arises through three channels, each representing a distinct class of best\-responding agents\.
Behavioral adaptation\.Humans alter their behavior in response to deployed systems\(Kulveitet al\.,[2025a](https://arxiv.org/html/2606.03237#bib.bib86)\)\. Students may restructure learning strategies around AI tutors, shifting the learner distribution the system encounters\. Pilots who rely heavily on autopilot can experience degradation of flying skills\(Parasuraman and Riley,[1997](https://arxiv.org/html/2606.03237#bib.bib76)\), changing the capabilities of the human counterparty the automation must complement\. In all such scenarios, the system confronts a distribution shaped by responses to its own presence\.
Institutional adaptation\.Organizations follow a similar pattern\. When algorithmic screening enters hiring processes, the surrounding practices shift where candidates may adjust their resumes, recruiters may recalibrate their criteria and HR departments may revise workflows to accommodate or counteract the tool\. Financial regulators have repeatedly modified rules in response to algorithmic trading\. Such adaptations constitute a strategic response, reshaping the environment in which the system operates\(Guala,[2016](https://arxiv.org/html/2606.03237#bib.bib164)\)\.
Algorithmic adaptation\.Other AI systems retrain, fine\-tune, and co\-evolve\. Pricing algorithms learn against each other, producing emergent collusion that no single system was designed to pursue\(Calvanoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib8)\)\. Recommenders respond to other recommenders’ behavior in the attention economy\. Such algorithmic evolution produces autocurricula\(Leiboet al\.,[2019](https://arxiv.org/html/2606.03237#bib.bib87)\), the emergent training distributions generated by the interaction of learning systems that no single system’s designers intended or anticipated\.
These channels may be uncertain in detail but predictable in kind\. We may not know precisely how each channel will evolve, but we can anticipate that they will adapt, their adaptations will be strategic, and those adaptations will reshape the distribution the deployed system faces \(Appendix[B](https://arxiv.org/html/2606.03237#A2)\)\.
Key Claim 2The train\-test\-deploy gap exposes a dataset shift characterized by structured non\-stationarity arising from multi\-player dynamics: humans, institutions, and algorithms respond to deployed systems, producing endogenous adaptations that can tip sociotechnical systems into degraded equilibria\. The class of such adaptations is wide but structured in important ways that the solipsistic paradigm fails to recognize\.
### 3\.4Equilibrium selection risk
Endogenous non\-stationarity does not merely add noise to performance estimates\. Strategic adaptation can tip systems across equilibrium boundaries, with consequences that persist long after the initial perturbation\.
Most coordination problems admitmultiple equilibria: different self\-reinforcing patterns of behavior that persist once they are established\(Luce and Raiffa,[1957](https://arxiv.org/html/2606.03237#bib.bib163); Sugden,[1986](https://arxiv.org/html/2606.03237#bib.bib162); Young,[2015](https://arxiv.org/html/2606.03237#bib.bib61)\)\. A given set of agents and incentives are typically consistent with many possible stable arrangements, some better than others\. The risk is that the system lands in a bad equilibrium rather than failing to find one\.
Introducing a powerful optimizer into a social system is aninterventionthat reshapes the payoff landscape\. The optimizer’s presence changes what strategies are available, what information is observable, and what adaptations are rewarded\. Threshold and tipping point models suggest that small differences can determine which equilibrium basin the coupled system settles into\(Centolaet al\.,[2018](https://arxiv.org/html/2606.03237#bib.bib59); Granovetter,[1978](https://arxiv.org/html/2606.03237#bib.bib60)\)\. Deployment details such as timing, scale, and interface design may shape this selection\. Once a system tips into a degraded equilibrium, escaping may be costly or impossible\(Arthur,[1989](https://arxiv.org/html/2606.03237#bib.bib75)\)\. Network effects, infrastructure dependencies and behavioral habits may create lock\-in\(Qiuet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib155)\)\. The speed of AI deployment carries equilibrium selection risk that would be hard to undo by subsequent patches\.
Self undermining arises because methods designed under assumptions of exogeneity and stationarity \(i\.e\. solipsism\) are being deployed into environments where those assumptions do not hold\.
## 4Prediction Is Not Participation
A natural objection arises: If the problem is that other agents adapt, why not model their adaptations? Interaction dynamics between multiple players, on this view, would present simply a harder prediction problem and not a categorically different one\. This section argues that the objection fails on two independent grounds either of which may suffice to block unilateral prediction as a solution path\.
### 4\.1Epistemic horizons
Machine learning systems are fundamentally inductive: they identify patterns in historical data and generalize to new instances drawn from similar distributions\. This inductive foundation encounters three limits when the task is predicting multi\-actor dynamics under deployment\.
\(i\) Novelty\.Large\-scale deployment of capable AI systems produce strategic configurations that have never existed\. The counterfactual, what happens when this system operates at scale among adaptive agents who know it exists, is unobservable by construction\. To advance capabilities beyond what has existed is precisely to introduce conditions that historical data cannot model\. This is the signature of open\-ended systems, in which interacting components generate persistent novelty that cannot be anticipated from any prior snapshot of the system\(Hugheset al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib177); Stanley,[2019](https://arxiv.org/html/2606.03237#bib.bib178)\), and where multi\-actor interaction itself is a central generator of that novelty\(Leiboet al\.,[2019](https://arxiv.org/html/2606.03237#bib.bib87)\)\. Past observations can inform expectations about individual behaviors, but they cannot capture how the AI’s own influence on the world gives rise to the self\-undermining property at deployment\.
\(ii\) Reflexivity\.Prediction in strategic environments is an active intervention\. When agents anticipate that a system will predict their behavior, they may adapt to the prediction itself\. The forecast becomes a variable in others’ decision problems, inducing responses that the original model did not contemplate\. The act of modeling is itself a move in the game\(Diazet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib203)\), and sophisticated counterparties will treat it as such\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib35)\)\. Reflexivity can also be used strategically\(Soros,[2013](https://arxiv.org/html/2606.03237#bib.bib202)\), as when “meme stock” companies took advantage of inflated expectations of their performance to issue new stock at inflated prices\.
\(iii\) Combinatorial explosion\.Modeling the full set of participants rapidly becomes intractable, since humans hold heterogeneous beliefs and goals, institutions follow complex decision procedures, and algorithms pursue opaque objectives\. The state space of joint behavior explodes and the relevant distributions resist tractable approximation\(Daskalakiset al\.,[2009](https://arxiv.org/html/2606.03237#bib.bib79)\)\. The standard way to restore tractability is to treat the other agents as a fixed part of the environment, which reinstates the exogeneity assumption\(Hernandez\-Lealet al\.,[2019](https://arxiv.org/html/2606.03237#bib.bib206)\)\. Once those agents adapt, the non\-stationarity it was meant to avoid simply reappears\.
### 4\.2Legitimacy and participation
Suppose the aforementioned epistemic limits could be overcome and some superintelligent AI could predict the full cascade of adaptive responses that its deployment would trigger\. Unilateral optimization would still face another barrier: the legitimacy constraints that open societies impose on prediction, steering, and control\(Habermas,[1975](https://arxiv.org/html/2606.03237#bib.bib179); Rawls,[1993](https://arxiv.org/html/2606.03237#bib.bib180); Pasquale,[2015](https://arxiv.org/html/2606.03237#bib.bib175); Crawford and Schultz,[2014](https://arxiv.org/html/2606.03237#bib.bib176); Hadfield and Weingast,[2014](https://arxiv.org/html/2606.03237#bib.bib161)\)\. Predicting behavior at scale also demands observation that crosses the contextual boundaries between distinct social spheres, eroding the contextual integrity on which trust depend\(Nissenbaum,[2004](https://arxiv.org/html/2606.03237#bib.bib78)\)\. These constraints operate as feasibility bounds on admissible solutions, not as preferences to be weighed against efficiency\.
Legal order\.Democratic governance and stable social coordination require that consequential decisions admit challenge through legitimate, recognized procedures\(Hampshire,[1999](https://arxiv.org/html/2606.03237#bib.bib213)\)\. A functioning legal order is not merely about achieving a given outcome, but relies on a system characterized by general rules and impersonal abstract reasoning implemented by open, public, and neutral procedures\(Hadfield and Weingast,[2012](https://arxiv.org/html/2606.03237#bib.bib204)\)\. These open processes are essential because they allow affected parties to introduce their private information, contest outcomes, and demand justifications\. Unilateral optimization by an AI system bypasses these critical mechanisms by imposing outcomes based on predictive accuracy and opaque logic rather than a common one established through public reasoning\. Even when an AI’s predictions or decisions are technically correct, the absence of due process delegitimizes the result\(Santoni de Sio and van den Hoven,[2018](https://arxiv.org/html/2606.03237#bib.bib10)\), undermining the coordination that legal order provides\.
Value pluralism\.Open societies are characterized by persistent, reasonable disagreement about values\(Berlin,[1969](https://arxiv.org/html/2606.03237#bib.bib67)\)\. This pluralism reflects the complexity of value, and a feature of a diverse, multicultural society\(Sorensenet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib166)\)\. All objective functions privilege some values over others\(Leiboet al\.,[2025b](https://arxiv.org/html/2606.03237#bib.bib158)\), imposing a resolution to disagreements that a democratic system deliberately leaves open\(Mouffe,[1999](https://arxiv.org/html/2606.03237#bib.bib83)\)\. Current alignment techniques \(e\.g\. reinforcement learning from human feedback\(Kaufmannet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib173)\)\), aggregate diverse preferences through an implicit voting rule, rather than preserving the underlying preference distribution\(Siththaranjanet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib4)\)\. A growing body of evidence shows that the resulting models produce homogeneous outputs\(Jianget al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib5)\)and measurably influence human language toward their own patterns\(Yakuraet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib3); Abdulhaiet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib205)\)\. Unilateral optimization thus may suppress the problem of coordination among agents with different values by forcing \(or nudging toward\) conformity, with predictable negative consequences\.
Preference endogeneity\.What people want is shaped through interaction rather than fixed beforehand\. Optimizing for engagement modifies beliefs and tastes and optimizing for efficiency may restructure routines\(Bowles,[1998](https://arxiv.org/html/2606.03237#bib.bib57); Bernheimet al\.,[2021](https://arxiv.org/html/2606.03237#bib.bib58); Leiboet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib167)\)\. For instance, on YouTube users consistently migrate from milder to progressively more extreme content, and recommender pathways can make such content reachable\(Ribeiroet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib2)\)\. If preferences are endogenous to the system’s operation, then satisfying revealed preferences shapes people rather than serving them\. The ability of modern agentic AIs to similarly transform and reshape human preferences is as yet only poorly understood\. An extreme version of this appears in recent clinical reports describing AI\-associated delusions, where extended dialogue with large language models acts as a mechanism that shifts conviction and modulates human belief\(Hudon and Stip,[2025](https://arxiv.org/html/2606.03237#bib.bib209); Morrinet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib198)\)\.
Goodhart dynamics\.Outcome\-based metrics often prove brittle in strategic environments\. When systems optimize for measurable proxies, those proxies decouple from the underlying goals they were meant to capture\(Johnet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib165)\)\. The most consequential versions of this decoupling involve multiple actors\. When a system optimizes against a metric, the agents whose behavior the metric was meant to summarize respond to the optimization and thereby cause the statistical regularity that made the metric useful in the first place to disappear \(since the metric captures a consequence of the behavior, not a cause of it\)\. Alignment faking exemplifies the same dynamic inside the training pipeline\. Here the evaluation is the metric, and a capable model that treats training as a game responds to it by presenting as cooperative under evaluation while pursuing other objectives in deployment\(Greenblattet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib16); Sheshadriet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib25)\)\. In a conventional sense, this looks like a single deceptive agent, but the decoupling happens only because the agent under evaluation is itself strategic and games the process meant to assess it\. Multi\-actor dynamics thus explain*why*the most important Goodhart effects arise, including those that appear only to involve a single agent\.
Key Claim 3Unilateral optimization cannot substitute for participation\. Epistemic limits foreclose anticipation of novel equilibria, while legitimacy constraints render unilateral solutions inadmissible even where prediction might succeed\. Either alone blocks the solipsistic path but together they establish that the viable way forward is to design AI capable of participation in the equilibrium\-selection process cooperation requires\.
## 5Toward Non\-Solipsistic Research
Each new technology deployment is an intervention into a coupled system, creating winners, losers, and second\-order instabilities the tech was not designed to handle\. The examples developed in Section[1](https://arxiv.org/html/2606.03237#S1)document this at length across markets, recommendation and language\-model deployment\. This also extends to other domains such as geopolitics and cybersecurity, where equilibria rest on rough offense and defense symmetries\(Brundageet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib181)\)\. Our paper makes the case for treating this coupling as foundational to how AI systems are evaluated, deployed and governed\. We call for a non\-solipsistic research agenda organized around three directions in which the multi\-actor design principle takes concrete shape: dynamic evaluation, institutions as design primitives, and the preservation of human agency\.
### 5\.1Dynamic Evaluation
We formalize an evaluation procedure as a tuple\(𝒟,μ\)\(\\mathcal\{D\},\\mu\), where𝒟\\mathcal\{D\}is a test distribution over interaction trajectories andμ\\muis a scoring functional mapping the AI’s behavior under𝒟\\mathcal\{D\}to a real\-valued score\. If𝒟\\mathcal\{D\}is fixed independently of the policyπ\\pibeing evaluated, the resulting procedure can be considered static\. This encompasses broadening the distribution𝒟\\mathcal\{D\}with techniques such as scaling task diversity and layering capability evaluation with human\-interaction and systemic\-impact assessment\. We argue that as long as the broadening does not account for the effect ofπ\\pi, the train\-test\-deploy gap will persist\. Adynamic evaluationis the one in which𝒟π\\mathcal\{D\}\_\{\\pi\}depends onπ\\pithrough the responses of adaptive counterparties whose policies update as a function ofπ\\pi’s behavior\. The scoreμ\(π;𝒟π\)\\mu\(\\pi;\\mathcal\{D\}\_\{\\pi\}\), then, reflects the coupled system rather thanπ\\pialone\. Under Definition[A\.1](https://arxiv.org/html/2606.03237#A1.Thmtheorem1), the deployment distribution is itself such a𝒟π\\mathcal\{D\}\_\{\\pi\}\.
We identify the key ingredients to design a valid dynamic evaluation procedure\. Counterparties must adapt strategically rather than randomly, in a pattern that renders the score interpretable\. The choice of policy class, update rule, and calibration against deployment play a vital role in this design\. Counterparties model the system that models them, producing a regress that any tractable evaluation must truncate, making the depth of recursive modeling an important choice\. The equilibrium concept the evaluation targets must be specified, since a score under𝒟π\\mathcal\{D\}\_\{\\pi\}is a measurement of the equilibrium the joint system is approaching, and different concepts correspond to different notions of performance\. The evaluation protocols must allow scores to be compared across runs, systems, and counterparty populations, since a single score against a single𝒟π\\mathcal\{D\}\_\{\\pi\}realization would be an isolated demonstration rather than a measurement instrument\.
Recently evaluation research has engaged with dynamic evaluation but no approach yet meets all the requirements we articulate in combination\. Holistic and sociotechnical evaluation frameworks\(Lianget al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib194); Srivastavaet al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib195); Weidingeret al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib196)\)broadened the metrics and scenarios against which systems are scored\. Dangerous\-capability and autonomous\-task evaluations\(Kinnimentet al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib200); Shevlaneet al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib197); Phuonget al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib199)\)introduce limited adaptation via counterparties, but these are typically scripted, so the reported behavior does not select the same equilibria likely to be selected in deployment\. Multi\-agent testbeds\(Leiboet al\.,[2021](https://arxiv.org/html/2606.03237#bib.bib34); Vezhnevetset al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib80)\), agentic economies\(Johansonet al\.,[2022](https://arxiv.org/html/2606.03237#bib.bib171); Tomaševet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib14); Hadfield and Koh,[2025](https://arxiv.org/html/2606.03237#bib.bib170)\), and open\-ended environments documenting agent\-interaction failures\(Shapiraet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib183)\)take a step forward by letting the system interact with a population of agents\. Appendix[C](https://arxiv.org/html/2606.03237#A3)outlines a set of dynamic evaluation methods and relevant existing works across them\.
### 5\.2Institutions
The institutional manifestation of the self\-undermining property is not a new problem\. Incentive structures erode as participants adapt, and institutions either update or cease to function\(Tainter,[1988](https://arxiv.org/html/2606.03237#bib.bib212)\)\. Cooperation at scale is sustained when surrounding institutions restructure incentives to make cooperative behavior individually rational\(Ostrom,[1990](https://arxiv.org/html/2606.03237#bib.bib62)\)\. Mechanism design formalizes this by characterizing how rules produce collective outcomes given agents with various properties\(Hurwicz,[1973](https://arxiv.org/html/2606.03237#bib.bib52); Maskin,[2008](https://arxiv.org/html/2606.03237#bib.bib53); Myerson,[2008](https://arxiv.org/html/2606.03237#bib.bib54)\)\.
Once we treat institutions as design objects, several instantiations open up within current AI pipelines\.\(Shaoet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib185)\)modifies standard RLHF by replacing fixed rewards with rubrics that co\-evolve with the agent\. Training\-time incentives are then restructured dynamically rather than against a static target\. In agentic marketplaces, agents interact through protocols for bidding, reputation, and communication that function as institutional constraints\. These protocols can themselves be co\-learned via mechanism design objectives\(Tomaševet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib14); Shahidiet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib68); Yanget al\.,[2022](https://arxiv.org/html/2606.03237#bib.bib84)\)\. Forum\-style environments such as Moltbook offer testbeds for investigating how institutional structure can shape the emergence, stabilization, and decay of norms among interacting agents\(Manik and Wang,[2026](https://arxiv.org/html/2606.03237#bib.bib182)\)\. The performative prediction framework\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib35)\)formalizes how a deployed predictor distorts the distribution it predicts\. Prediction markets address this by anchoring the training signal to realized events rather than to a metric the system can game \(preserving reflexivity at the level of outcomes rather than measurements\)\. Digital institutions\(Hadfieldet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib187); Leiboet al\.,[2025a](https://arxiv.org/html/2606.03237#bib.bib207)\)produce shared classifications of agent behavior at the speed and scale of AI deployment\. Their function is to resolve ambiguity, adapt with circumstances, and serve as a reference that coordinates normative judgment across agent populations\.
### 5\.3Preserving Human Agency
Section[4](https://arxiv.org/html/2606.03237#S4)implies the following commitment: the human response to new technology is a fundamental design constraint, rather than a nuisance parameter to be controlled for\. Section[5\.2](https://arxiv.org/html/2606.03237#S5.SS2)addressed the channel where humans and organizations adjust to deployed systems on timescales of days to years through behavior change and institutional revision\. A second channel is less visible, where humans form themselves inside a world that contains the AI, rather than just respond to it\. The self\-undermining property takes its most consequential form on this channel\. What humans learn, practice, and come to rely on is shaped by the system’s presence, so the human capabilities the AI will encounter on its next deployment are partly co\-produced by its previous deployments\(Kulveitet al\.,[2025a](https://arxiv.org/html/2606.03237#bib.bib86)\)\.
The solipsistic paradigm treats human cognition and skills as a fixed distribution\. In practice, that distribution is being continuously reshaped by the systems humans interact with\. Students develop cognitive habits with AI tutors at their elbow\. Similarly, researchers build careers in fields whose tools and norms are being reshaped faster than the training of the people entering them\. Joint\-system metrics that measure human plus AI output\(Narayanan and Kapoor,[2025](https://arxiv.org/html/2606.03237#bib.bib172)\)cannot see the slower channel along which human learning is being shaped, which is precisely where the long\-term shifts compound\. From the non\-solipsistic view, this channel is a research direction of its own, concerning what happens to the human learning as the joint system evolves\.
The design target on this channel is the distinction between tools that expand the human option space, augmenting agency, and tools that replace human decision\-making, compressing it\. Systems that present options and defer to human judgment preserve the deliberative role that legitimacy requires\. Systems that optimize end\-to\-end risk making human participation nominal\(Santoni de Sio and van den Hoven,[2018](https://arxiv.org/html/2606.03237#bib.bib10)\)\. Keeping humans in the loop requires that they retain the skills, information, and cognitive engagement needed to exercise meaningful authority, rather than only the formal authority to intervene\(Parasuraman and Riley,[1997](https://arxiv.org/html/2606.03237#bib.bib76)\)\. Designing agents to effectively coordinate with humans calls for architectures that capture behavioral diversity and allow steering at deployment\(Trivediet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib85); Jhaet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib150)\)\. Finally, it is imperative to include the impact assessment of AI deployment on human skills, autonomy, and meaningful choice as a core part of evaluation pipelines, rather than as a separate ethical concern\(Zhuanget al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib151); Haupt and Brynjolfsson,[2025](https://arxiv.org/html/2606.03237#bib.bib152); Kulveitet al\.,[2025b](https://arxiv.org/html/2606.03237#bib.bib153)\)\.
## 6Alternative Views
Here we discuss the central alternative views our exposition invites\. Appendix[D](https://arxiv.org/html/2606.03237#A4)additionally summarizes rebuttals to an extended set of objections\.
Argument 1\. Multi\-actor designs have worse failure modes\.Economies of interacting actors introduce coordination failures that single aligned optimizers avoid\. Multiple actors can race to the bottom, collude, or deadlock\. Decentralizing capability across many agents does not eliminate the coordination problem; it multiplies the points of failure and makes oversight harder\. A well\-aligned monolithic system offers more tractable safety guarantees than a poorly understood ecosystem of interacting ones\(Bostrom,[2014](https://arxiv.org/html/2606.03237#bib.bib208)\)\.
Rebuttal\.Multi\-actor systems do exhibit coordination failures, and decentralization alone guarantees nothing\(Ostrom,[1990](https://arxiv.org/html/2606.03237#bib.bib62); Hardin,[1968](https://arxiv.org/html/2606.03237#bib.bib96)\)\. The argument’s comparison between a well\-aligned monolithic system and a poorly understood ecosystem is, however, not the relevant one\. A monolithic optimizer deployed among humans, institutions, and other algorithms encounters interaction dynamics at deployment, while having been designed as if they did not exist\. The actual choice is between systems that acknowledge strategic interdependence in their design and systems that defer this reckoning until deployment, when the dynamics are least tractable\. The train\-test\-deploy gap \(Section[3](https://arxiv.org/html/2606.03237#S3)\) is precisely what results from this deferral\.
The argument’s appeal to oversight and regulation is telling\. Regulation exists because markets, left to their own dynamics, produce externalities, collusion, and instability\. Regulation is itself an institutional technology for managing multi\-actor coordination\(Hurwicz,[1973](https://arxiv.org/html/2606.03237#bib.bib52)\)\. It thus demonstrates that humans have developed governing tools for multi\-actor systems, tools the solipsistic paradigm ignores\. Computational mechanism design\(Parkes and Wellman,[2015](https://arxiv.org/html/2606.03237#bib.bib41)\), reputation systems, and coordination protocols are the AI analogues of these institutional technologies\.
Tractability in design does not imply robustness at deployment\. A single aligned optimizer also presents a single point of failure because if its alignment is subtly wrong, or if conditions shift beyond its training distribution, there is no redundancy, no competitive pressure, and no distributed check\. The resilience of distributed systems is well\-documented in domains from internet architecture to immune systems to ecological networks\(Page,[2010](https://arxiv.org/html/2606.03237#bib.bib98)\)\.
Argument 2\. Competitive pressure produces cooperation naturally\.Markets and evolution produce cooperation through competition\. Non\-cooperative strategies get outcompeted or regulated\. AI development follows similar dynamics: systems that fail to cooperate will be abandoned by users, rejected by regulators, or outcompeted by more cooperative alternatives\. No explicit design for cooperation is needed as selection pressure will do the work\.
Rebuttal\.Selection does produce cooperation, under specific conditions that AI deployment systematically violates\. Evolutionary cooperation requires repeated interaction with identifiable partners, mechanisms for reputation and punishment, and timescales that allow selection to operate before damage accumulates\(Axelrod,[1984](https://arxiv.org/html/2606.03237#bib.bib63); Nowak,[2006](https://arxiv.org/html/2606.03237#bib.bib45)\)\. Market cooperation similarly requires low transaction costs, well\-defined rights, and manageable externalities\(Coase,[1960](https://arxiv.org/html/2606.03237#bib.bib49)\)\. When these conditions hold, competitive pressure can favor cooperative strategies\. When they fail, selection produces arms races, exploitation, monopoly, and collapse\.
AI deployment fails these conditions on multiple dimensions\. Interactions are often anonymous or intermediated, and systems can be retrained or deployed through shells that obscure accountability\. Externalities are pervasive \(e\.g\. harms from engagement\-maximizing recommenders\)\. LLM\-based pricing agents autonomously converge on supracompetitive prices in oligopoly settings\(Fishet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib188)\), divide markets in multi\-commodity Cournot competition\(Linet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib189)\), and self\-play Q\-learners provably learn collusive policies in iterated social dilemmas\(Bertrandet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib190)\)\. These are degraded equilibria in the sense of Section[3](https://arxiv.org/html/2606.03237#S3)\. Recent work does report cooperation emerging from intergroup competition in language model agents\(Tonini and Galke,[2025](https://arxiv.org/html/2606.03237#bib.bib191)\), but the underlying setting is iterated prisoner’s dilemma with repeated interaction, identifiable partners, and stationary rules, conditions that satisfy the classical requirements for cooperation by construction\(Axelrod,[1984](https://arxiv.org/html/2606.03237#bib.bib63); Nowak,[2006](https://arxiv.org/html/2606.03237#bib.bib45)\)\. The dependence of these outcomes on game structure, training procedure, and initial conditions is itself the point\. Cooperation under deployment is context\-determined, which is precisely what the multi\-actor design principle treats as first\-order\.
Argument 3\. The empirical track record does not support alarm\.Current AI has not caused catastrophic coordination failures\. The theoretical concerns are speculative and we should wait for evidence of actual systemic failures before overhauling the research paradigm\.
Rebuttal\.Recommenders have not collapsed society, but they have measurably increased polarization, degraded epistemic commons, and reshaped political discourse in ways that democracies are struggling to absorb\(Germanoet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib69); Milliet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib148)\)\. Similar patterns appear across the examples developed in Section[1](https://arxiv.org/html/2606.03237#S1), including algorithmic collusion in pricing\(Calvanoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib8)\), the Flash Crash\(Kirilenkoet al\.,[2017](https://arxiv.org/html/2606.03237#bib.bib9)\), and alignment faking in language models\(Greenblattet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib16); Sheshadriet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib25)\)\. So the claim that the empirical track record does not support alarm is itself highly questionable\. Moreover, the argument here would invert the appropriate burden of proof\. Waiting for evidence of systemic failure before changing course is precisely the Collingridge dilemma\. By the time consequences are undeniable, the technology is entrenched and correction is costly\(Collingridge,[1980](https://arxiv.org/html/2606.03237#bib.bib66)\)\. The absence of catastrophe thus far may reflect the limited capability and deployment scale of current systems or the short time period for human adaptation rather than the adequacy of the solipsistic approach\.
## 7Conclusion
We have argued that solipsistic superintelligence, however capable on stationary tasks, is unlikely to be cooperative\. Unilateral optimization in environments populated by other adaptive agents is self\-undermining since deployment induces the very non\-stationarities that invalidate training assumptions\. Epistemic limits foreclose prediction of novel equilibria, and legitimacy constraints rule out unilateral control even where prediction might succeed\. These are structural features of optimization in strategic environments, unlikely to be resolved by scale\. The shift we call for treats deployment as an intervention into a coupled system that pushes back, rather than as insertion into a fixed one\. This entails building evaluation frameworks where test distributions are generated by adaptive counterparties, treating institutions as design primitives that restructure incentives at the pace of the systems they govern, and preserving human agency as a structural feature of the systems we build\. Coexistence, rather than capability, is the binding constraint on beneficial AI, and our approach to AI must reflect this\.
## References
- M\. Abdulhai, I\. White, Y\. Wan, I\. Qureshi, J\. Z\. Leibo, M\. Kleiman\-Weiner, and N\. Jaques \(2026\)How llms distort our written language\.arXiv:2603\.18161\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- E\. Akata, L\. Schulz, J\. Coda\-Forno, S\. J\. Oh, M\. Bethge, and E\. Schulz \(2025\)Playing repeated games with large language models\.Nature Human Behavior\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px2.p4.1)\.
- J\. Alm \(2021\)Tax evasion, technology, and inequality\.Economics of Governance22,pp\. 321–343\.Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px4.p1.1)\.
- N\. Alzahrani, H\. Alyahya, Y\. Alnumay, S\. AlRashed, S\. Alsubaie, Y\. Almushayqih, F\. Mirza, N\. Alotaibi, N\. Al\-Twairesh, A\. Alowisheq, M\. S\. Bari, and H\. Khan \(2024\)When benchmarks are targets: revealing the sensitivity of large language model leaderboards\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p1.1)\.
- K\. C\. Arnold, K\. Chauncey, and K\. Z\. Gajos \(2020\)Predictive text encourages predictable writing\.InProceedings of the ACM Conference on Intelligent User Interfaces,pp\. 128–138\.Cited by:[§B\.1](https://arxiv.org/html/2606.03237#A2.SS1.SSS0.Px3.p1.1)\.
- W\. B\. Arthur \(1989\)Competing technologies, increasing returns, and lock\-in by historical events\.The Economic Journal99\(394\),pp\. 116–131\.Cited by:[§A\.5](https://arxiv.org/html/2606.03237#A1.SS5.SSS0.Px1.p1.1),[§3\.4](https://arxiv.org/html/2606.03237#S3.SS4.p3.1)\.
- A\. Askell, M\. Brundage, and G\. Hadfield \(2019\)The role of cooperation in responsible AI development\.arXiv:1907\.04534\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p8.1)\.
- R\. Axelrod \(1984\)The evolution of cooperation\.Basic Books\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p3.1),[§6](https://arxiv.org/html/2606.03237#S6.p7.1),[§6](https://arxiv.org/html/2606.03237#S6.p8.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan \(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv:2204\.05862\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p2.1)\.
- B\. Baker, I\. Kanitscheider, T\. Markov, Y\. Wu, G\. Powell, B\. McGrew, and I\. Mordatch \(2020\)Emergent tool use from multi\-agent autocurricula\.InInternational Conference on Learning Representations,Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px2.p4.1)\.
- M\. Banchio and A\. Skrzypacz \(2022\)Artificial intelligence and auction design\.Stanford University Working Paper\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px1.p1.1)\.
- I\. Berlin \(1969\)Four essays on liberty\.Oxford University Press\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- B\. D\. Bernheim, L\. Braghieri, A\. Martínez\-Marquina, and D\. Zuckerman \(2021\)A theory of chosen preferences\.American Economic Review111\(2\),pp\. 720–754\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p4.1)\.
- Q\. Bertrand, J\. A\. Duque, E\. Calvano, and G\. Gidel \(2025\)Self\-play Q\-learners can provably collude in the iterated prisoner’s dilemma\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 3952–3975\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px5.p1.1),[§6](https://arxiv.org/html/2606.03237#S6.p8.1)\.
- N\. Bostrom \(2012\)The superintelligent will: motivation and instrumental rationality in advanced artificial agents\.Minds and Machines22,pp\. 71–85\.External Links:[Document](https://dx.doi.org/10.1007/s11023-012-9281-3)Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1)\.
- N\. Bostrom \(2014\)Superintelligence: paths, dangers, strategies\.Oxford University Press, Oxford\.Cited by:[§6](https://arxiv.org/html/2606.03237#S6.p2.1)\.
- S\. Bowles \(1998\)Endogenous preferences: the cultural consequences of markets and other economic institutions\.Journal of economic literature36\(1\),pp\. 75–111\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p2.1),[§1](https://arxiv.org/html/2606.03237#S1.p4.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p4.1)\.
- M\. Brundage, S\. Avin, J\. Clark, H\. Toner, P\. Eckersley, B\. Garfinkel, A\. Dafoe, P\. Scharre, T\. Zeitzoff, B\. Filar, H\. Anderson, H\. Roff, G\. C\. Allen, J\. Steinhardt, C\. Flynn, S\. O\. hEigeartaigh, S\. Beard, H\. Belfield, S\. Farquhar, C\. Lyle, R\. Crootof, O\. Evans, M\. Page, J\. Bryson, R\. Yampolskiy, and D\. Amodei \(2024\)The malicious use of artificial intelligence: forecasting, prevention, and mitigation\.arXiv: 1802\.07228\.Cited by:[§5](https://arxiv.org/html/2606.03237#S5.p1.1)\.
- D\. Buschek, M\. Zürn, and M\. Eiband \(2021\)The impact of multiple parallel phrase suggestions on email input and composition behaviour of native and non\-native english writers\.Proceedings of the ACM on Human\-Computer Interaction5\(CSCW1\),pp\. 1–22\.Cited by:[§B\.1](https://arxiv.org/html/2606.03237#A2.SS1.SSS0.Px3.p1.1)\.
- E\. Calvano, G\. Calzolari, V\. Denicolò, and S\. Pastorello \(2020\)Artificial intelligence, algorithmic pricing, and collusion\.American Economic Review110\(10\),pp\. 3267–3297\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1),[§3\.3](https://arxiv.org/html/2606.03237#S3.SS3.p4.1),[§6](https://arxiv.org/html/2606.03237#S6.p10.1)\.
- D\. Centola, J\. Becker, D\. Brackbill, and A\. Baronchelli \(2018\)Experimental evidence for tipping points in social convention\.Science360\(6393\),pp\. 1116–1119\.Cited by:[§A\.5](https://arxiv.org/html/2606.03237#A1.SS5.p2.1),[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1),[§3\.4](https://arxiv.org/html/2606.03237#S3.SS4.p3.1)\.
- R\. H\. Coase \(1960\)The problem of social cost\.Journal of Law and Economics3,pp\. 1–44\.Cited by:[§6](https://arxiv.org/html/2606.03237#S6.p7.1)\.
- D\. Collingridge \(1980\)The social control of technology\.Frances Pinter\.Cited by:[§6](https://arxiv.org/html/2606.03237#S6.p10.1)\.
- V\. Conitzer and C\. Oesterheld \(2023\)Foundations of cooperative AI\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p8.1)\.
- D\. R\. Cotton, P\. A\. Cotton, and J\. R\. Shipway \(2023\)Chatting and cheating: ensuring academic integrity in the era of chatgpt\.Innovations in Education and Teaching International\.Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px2.p1.1)\.
- K\. Crawford and J\. Schultz \(2014\)Big data and due process: toward a framework to redress predictive privacy harms\.Boston College Law Review\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p4.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p1.1)\.
- A\. Dafoe, E\. Hughes, Y\. Bachrach, T\. Collins, K\. R\. McKee, J\. Z\. Leibo, K\. Larson, and T\. Graepel \(2020\)Open problems in cooperative ai\.arXiv:2012\.08630\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p8.1)\.
- L\. Dahmani and V\. D\. Bohbot \(2020\)Habitual use of gps negatively impacts spatial memory during self\-guided navigation\.Scientific Reports10,pp\. 6310\.Cited by:[§B\.1](https://arxiv.org/html/2606.03237#A2.SS1.SSS0.Px1.p1.1)\.
- P\. Daian, S\. Goldfeder, T\. Kell, Y\. Li, X\. Zhao, I\. Bentov, L\. Breidenbach, and A\. Juels \(2020\)Flash boys 2\.0: frontrunning in decentralized exchanges, miner extractable value, and consensus instability\.InIEEE Symposium on Security and Privacy,pp\. 910–927\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px4.p1.1)\.
- C\. Daskalakis, P\. W\. Goldberg, and C\. H\. Papadimitriou \(2009\)The complexity of computing a nash equilibrium\.InSIAM Journal on Computing,Cited by:[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p4.1)\.
- M\. Diaz, J\. Z\. Leibo, and L\. Paull \(2024\)Milnor\-myerson games and the principles of artificial principal\-agent problems\.InFinding the Frame: An RLC Workshop for Examining Conceptual Frameworks,Cited by:[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p3.1)\.
- E\. A\. Duéñez\-Guzmán, S\. Sadedin, J\. X\. Wang, K\. R\. McKee, and J\. Z\. Leibo \(2023\)A social path to human\-like artificial intelligence\.Nature machine intelligence5\(11\),pp\. 1181–1188\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1)\.
- J\. Edelman, T\. Zhi\-Xuan, R\. Lowe, O\. Klingefjord, V\. Wang\-Mascianica, M\. Franklin, R\. O\. Kearns, E\. Hain, A\. Sarkar, M\. Bakker, F\. Barez, D\. Duvenaud, J\. Foerster, I\. Gabriel, J\. Gubbels, B\. Goodman, A\. Haupt, J\. Heitzig, J\. Jara\-Ettinger, A\. Kasirzadeh, J\. R\. Kirkpatrick, A\. Koh, W\. B\. Knox, P\. Koralus, J\. Lehman, S\. Levine, S\. Marro, M\. Revel, T\. Shorin, M\. Sutherland, M\. H\. Tessler, I\. Vendrov, and J\. Wilken\-Smith \(2025\)Full\-stack alignment: co\-aligning ai and institutions with thick models of value\.arXiv:2512\.03399\.Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p1.1)\.
- G\. Ellison \(2000\)Basins of attraction, long\-run stochastic stability, and the speed of step\-by\-step evolution\.The Review of Economic Studies67\(1\),pp\. 17–45\.Cited by:[§A\.5](https://arxiv.org/html/2606.03237#A1.SS5.p3.1)\.
- J\. Evans, B\. Bratton, and B\. Agüera y Arcas \(2026\)Agentic AI and the next intelligence explosion\.Science391\(6791\)\.Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p1.1)\.
- E\. F\. Fama \(1970\)Efficient capital markets: a review of theory and empirical work\.The Journal of Finance25\(2\),pp\. 383–417\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px2.p2.1)\.
- A\. M\. Fink \(1964\)Equilibrium in a stochasticnn\-person game\.Journal of Science of the Hiroshima University, Series AI \(Mathematics\)28\(1\),pp\. 89–93\.Cited by:[§A\.4](https://arxiv.org/html/2606.03237#A1.SS4.p3.1)\.
- S\. Fish, Y\. A\. Gonczarowski, and R\. I\. Shorrer \(2026\)Algorithmic collusion by large language models\.arXiv:2404\.00806\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px5.p1.1),[§6](https://arxiv.org/html/2606.03237#S6.p8.1)\.
- S\. Funk, M\. Salathé, and V\. A\. Jansen \(2010\)Modelling the influence of human behaviour on the spread of infectious diseases: a review\.Journal of the Royal Society Interface7\(50\),pp\. 1247–1256\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px2.p2.1)\.
- D\. F\. Galletta, A\. Durcikova, A\. Everard, and B\. M\. Jones \(2005\)Does spell\-checking software need a warning label?\.Communications of the ACM48\(7\),pp\. 82–86\.Cited by:[§B\.1](https://arxiv.org/html/2606.03237#A2.SS1.SSS0.Px2.p1.1)\.
- F\. Germano, V\. Gómez, and F\. Sobbrio \(2026\)Ranking for engagement: how social media algorithms fuel misinformation and polarization\.Journal of Public Economics\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1),[§6](https://arxiv.org/html/2606.03237#S6.p10.1)\.
- M\. Granovetter \(1978\)Threshold models of collective behavior\.American journal of sociology83\(6\),pp\. 1420–1443\.Cited by:[§A\.5](https://arxiv.org/html/2606.03237#A1.SS5.p2.1),[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1),[§3\.4](https://arxiv.org/html/2606.03237#S3.SS4.p3.1)\.
- R\. Greenblatt, C\. Denison, B\. Wright, F\. Roger, M\. MacDiarmid, S\. Marks, J\. Treutlein, T\. Belonax, J\. Chen, D\. Duvenaud, A\. Khan, J\. Michael, S\. Mindermann, E\. Perez, L\. Petrini, J\. Uesato, J\. Kaplan, B\. Shlegeris, S\. R\. Bowman, and E\. Hubinger \(2024\)Alignment faking in large language models\.arXiv:2412\.14093\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p1.1),[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p2.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p5.1),[§6](https://arxiv.org/html/2606.03237#S6.p10.1)\.
- M\. Groves and K\. Mundt \(2015\)Friend or foe? google translate in language for academic purposes\.English for Specific Purposes37,pp\. 112–121\.Cited by:[§B\.1](https://arxiv.org/html/2606.03237#A2.SS1.SSS0.Px4.p1.1)\.
- F\. Guala \(2016\)Understanding institutions: the science and philosophy of living together\.InUnderstanding institutions,Cited by:[§3\.3](https://arxiv.org/html/2606.03237#S3.SS3.p3.1)\.
- Z\. Gyöngyi and H\. Garcia\-Molina \(2005\)Web spam taxonomy\.InProceedings of the International Workshop on Adversarial Information Retrieval on the Web,Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px3.p1.1)\.
- J\. Habermas \(1975\)Legitimation crisis\.Beacon Press,Boston\.Note:Translated by Thomas McCarthyCited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p1.1)\.
- G\. K\. Hadfield and A\. Koh \(2025\)An economy of ai agents\.arXiv:2509\.01063\.Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.2.1.4.1.1),[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- G\. K\. Hadfield and B\. R\. Weingast \(2012\)What is law? a coordination model of the characteristics of legal order\.Journal of Legal Analysis4\(2\),pp\. 471–514\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p2.1)\.
- G\. K\. Hadfield and B\. R\. Weingast \(2014\)Microfoundations of the rule of law\.Annual Review of Political Science17\(1\),pp\. 21–42\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p4.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p1.1)\.
- G\. K\. Hadfield, R\. S\. Trivedi, and D\. Hadfield\-Menell \(2026\)Building ai for the democratic matrix: a technical research agenda for normative competence and normative institutions\.26\-2 Knight First Amend\. Inst\.Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- L\. Hammond, A\. Chan, J\. Clifton, J\. Hoelscher\-Obermaier, A\. Khan, E\. McLean, C\. Smith, W\. Barfuss, J\. Foerster, T\. Gavenčiak, T\. A\. Han, E\. Hughes, V\. Kovařík, J\. Kulveit, J\. Z\. Leibo, C\. Oesterheld, C\. S\. de Witt, N\. Shah, M\. Wellman, P\. Bova, T\. Cimpeanu, C\. Ezell, Q\. Feuillade\-Montixi, M\. Franklin, E\. Kran, I\. Krawczuk, M\. Lamparth, N\. Lauffer, A\. Meinke, S\. Motwani, A\. Reuel, V\. Conitzer, M\. Dennis, I\. Gabriel, A\. Gleave, G\. Hadfield, N\. Haghtalab, A\. Kasirzadeh, S\. Krier, K\. Larson, J\. Lehman, D\. C\. Parkes, G\. Piliouras, and I\. Rahwan \(2025\)Multi\-agent risks from advanced ai\.arXiv:2502\.14143\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p3.1),[§1](https://arxiv.org/html/2606.03237#S1.p8.1)\.
- S\. Hampshire \(1999\)Justice is conflict\.Princeton University Press\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p2.1)\.
- G\. Hardin \(1968\)The tragedy of the commons\.Science162\(3859\),pp\. 1243–1248\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1),[§6](https://arxiv.org/html/2606.03237#S6.p3.1)\.
- A\. Haupt and E\. Brynjolfsson \(2025\)Position: AI should not be an imitation game: centaur evaluations\.InForty\-second International Conference on Machine Learning Position Paper Track,Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.3.2.4.1.1),[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p3.1)\.
- P\. Hernandez\-Leal, M\. Kaisers, T\. Baarslag, and E\. M\. de Cote \(2019\)A survey of learning in multiagent environments: dealing with non\-stationarity\.arXiv:1707\.09183\.Cited by:[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p4.1)\.
- A\. Hudon and E\. Stip \(2025\)Delusional experiences emerging from AI chatbot interactions or “AI psychosis”\.JMIR Mental Health\.External Links:[Document](https://dx.doi.org/10.2196/85799)Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p4.1)\.
- E\. Hughes, M\. Dennis, J\. Parker\-Holder, F\. Behbahani, A\. Mavalankar, Y\. Shi, T\. Schaul, and T\. Rocktäschel \(2024\)Position: open\-endedness is essential for artificial superhuman intelligence\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p2.1)\.
- L\. Hurwicz \(1973\)The design of mechanisms for resource allocation\.American Economic Review63\(2\),pp\. 1–30\.Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p1.1),[§6](https://arxiv.org/html/2606.03237#S6.p4.1)\.
- S\. Husnjak, D\. Peraković, I\. Forenbacher, and M\. Mumdziev \(2015\)Telematics system in usage based motor insurance\.Procedia Engineering100,pp\. 816–825\.Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px1.p1.1)\.
- K\. Jha, W\. Carvalho, Y\. Liang, S\. S\. Du, M\. Kleiman\-Weiner, and N\. Jaques \(2025\)Cross\-environment cooperation enables zero\-shot multi\-agent coordination\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 27198–27220\.Cited by:[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p3.1)\.
- J\. Ji, T\. Qiu, B\. Chen, J\. Zhou, B\. Zhang, D\. Hong, H\. Lou, K\. Wang, Y\. Duan, Z\. He, L\. Vierling, Z\. Zhang, F\. Zeng, J\. Dai, X\. Pan, H\. Xu, A\. O’Gara, K\. Ng, B\. Tse, J\. Fu, S\. Mcaleer, Y\. Wang, M\. Yang, Y\. Liu, Y\. Wang, S\. Zhu, Y\. Guo, Y\. Yang, and W\. Gao \(2025\)AI alignment: a contemporary survey\.ACM Comput\. Surv\.\.External Links:ISSN 0360\-0300Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1)\.
- L\. Jiang, Y\. Chai, M\. Li, M\. Liu, R\. Fok, N\. Dziri, Y\. Tsvetkov, M\. Sap, and Y\. Choi \(2026\)Artificial hivemind: the open\-ended homogeneity of language models \(and beyond\)\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p1.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- Y\. Jin and S\. Vasserman \(2021\)Buying data from consumers: the impact of monitoring in U\.S\. auto insurance\.Technical reportTechnical Report29096,National Bureau of Economic Research\.Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px1.p1.1)\.
- M\. B\. Johanson, E\. Hughes, F\. Timbers, and J\. Z\. Leibo \(2022\)Emergent bartering behaviour in multi\-agent reinforcement learning\.arXiv:2205\.06760\.Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.2.1.4.1.1),[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- Y\. J\. John, L\. Caldwell, D\. E\. McCoy, and O\. Braganza \(2024\)Dead rats, dopamine, performance metrics, and peacock tails: proxy failure is an inherent risk in goal\-oriented systems\.Behavioral and Brain Sciences47,pp\. e67\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p5.1)\.
- T\. Kaufmann, P\. Weng, V\. Bengs, and E\. Hüllermeier \(2025\)A survey of reinforcement learning from human feedback\.arXiv:2312\.14925\.Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- Z\. Kenton, T\. Everitt, L\. Weidinger, I\. Gabriel, V\. Mikulik, and G\. Irving \(2021\)Alignment of language agents\.arXiv:2103\.14659\.Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p1.1)\.
- M\. Kinniment, L\. J\. K\. Sato, H\. Du, B\. Goodrich, M\. Hasin, L\. Chan, L\. H\. Miles, T\. R\. Lin, H\. Wijk, J\. Burget, A\. Ho, E\. Barnes, and P\. Christiano \(2023\)Evaluating language\-model agents on realistic autonomous tasks\.arXiv:2312\.11671\.Cited by:[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- A\. Kirilenko, A\. S\. Kyle, M\. Samadi, and T\. Tuzun \(2017\)The flash crash: high\-frequency trading in an electronic market\.The Journal of Finance72\(3\),pp\. 967–998\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1),[§6](https://arxiv.org/html/2606.03237#S6.p10.1)\.
- J\. Kulveit, R\. Douglas, N\. Ammann, D\. Turan, D\. Krueger, and D\. Duvenaud \(2025a\)Gradual disempowerment: systemic existential risks from incremental ai development\.arXiv:2501\.16946\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.03237#S3.SS3.p2.1),[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p1.1)\.
- J\. Kulveit, G\. Leech, T\. Gavencíak, and R\. Douglas \(2025b\)AI evaluation should work with humans\.InAdvances in Neural Information Processing Systems,Vol\.39\.Note:Position Paper TrackCited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.3.2.4.1.1),[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p3.1)\.
- S\. Legg and M\. Hutter \(2007\)Universal intelligence: a definition of machine intelligence\.Minds and machines17\(4\),pp\. 391–444\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p4.1)\.
- J\. Z\. Leibo, E\. Hughes, M\. Lanctot, and T\. Graepel \(2019\)Autocurricula and the emergence of innovation from social interaction: a manifesto for multi\-agent intelligence research\.arXiv:1903\.00742\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p8.1),[§3\.3](https://arxiv.org/html/2606.03237#S3.SS3.p4.1),[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p2.1)\.
- J\. Z\. Leibo, A\. S\. Vezhnevets, W\. A\. Cunningham, and S\. M\. Bileschi \(2025a\)A pragmatic view of ai personhood\.arXiv:2510\.26396\.Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- J\. Z\. Leibo, A\. S\. Vezhnevets, W\. A\. Cunningham, S\. Krier, M\. Diaz, and S\. Osindero \(2025b\)Societal and technological progress as sewing an ever\-growing, ever\-changing, patchy, and polychrome quilt\.arXiv:2505\.05197\.Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- J\. Z\. Leibo, E\. Duéñez\-Guzmán, A\. S\. Vezhnevets, J\. P\. Agapiou, P\. Sunehag, R\. Koster, J\. Matyas, C\. Beattie, I\. Mordatch, and T\. Graepel \(2021\)Scalable evaluation of multi\-agent reinforcement learning with melting pot\.InProceedings of the 38th International Conference on Machine Learning \(ICML\),Note:arXiv:2107\.06857Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.2.1.4.1.1),[§1](https://arxiv.org/html/2606.03237#S1.p8.1),[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- J\. Z\. Leibo, A\. S\. Vezhnevets, M\. Diaz, J\. P\. Agapiou, W\. A\. Cunningham, P\. Sunehag, J\. Haas, R\. Koster, E\. A\. Duéñez\-Guzmán, W\. S\. Isaac, G\. Piliouras, S\. M\. Bileschi, I\. Rahwan, and S\. Osindero \(2024\)A theory of appropriateness with applications to generative artificial intelligence\.arXiv:2412\.19010\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p4.1)\.
- D\. Lewandowski \(2023\)Understanding search engines\.Springer\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px3.p1.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Re, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. WANG, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yuksekgonul, M\. Suzgun, N\. Kim, N\. Guha, N\. S\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. A\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.Cited by:[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- R\. Y\. Lin, S\. Ojha, K\. Cai, and M\. F\. Chen \(2025\)Strategic collusion of llm agents: market division in multi\-commodity competitions\.arXiv:2410\.00031\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px5.p1.1),[§6](https://arxiv.org/html/2606.03237#S6.p8.1)\.
- M\. L\. Littman \(1994\)Markov games as a framework for multi\-agent reinforcement learning\.InProceedings of the 11th International Conference on Machine Learning,pp\. 157–163\.Cited by:[§A\.1](https://arxiv.org/html/2606.03237#A1.SS1.p2.2),[§3\.2](https://arxiv.org/html/2606.03237#S3.SS2.p2.2)\.
- R\. D\. Luce and H\. Raiffa \(1957\)Games and decisions: introduction and critical survey\.Courier Corporation\.Cited by:[§3\.4](https://arxiv.org/html/2606.03237#S3.SS4.p2.1)\.
- I\. Makarov and A\. Schoar \(2020\)Trading and arbitrage in cryptocurrency markets\.Journal of Financial Economics135\(2\),pp\. 293–319\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px4.p1.1)\.
- M\. M\. H\. Manik and G\. Wang \(2026\)OpenClaw agents on moltbook: risky instruction sharing and norm enforcement in an agent\-only social network\.arXiv:2602\.02625\.Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- G\. Marwell and P\. Oliver \(1993\)The critical mass in collective action\.Cambridge University Press\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1)\.
- E\. Maskin \(2008\)Mechanism design: how to implement social goals\.American Economic Review98\(3\),pp\. 567–576\.Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p2.1),[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p1.1)\.
- R\. D\. McKelvey and T\. R\. Palfrey \(1995\)Quantal response equilibria for normal form games\.Games and Economic Behavior10\(1\),pp\. 6–38\.Cited by:[item 2](https://arxiv.org/html/2606.03237#A1.I2.i2.p1.1)\.
- S\. Milli, M\. Carroll, Y\. Wang, S\. Pandey, S\. Zhao, and A\. D\. Dragan \(2025\)Engagement, user satisfaction, and the amplification of divisive content on social media\.PNAS nexus4\(3\),pp\. pgaf062\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1),[§6](https://arxiv.org/html/2606.03237#S6.p10.1)\.
- H\. Morrin, L\. Nicholls, Q\. Deeley, and T\. Pollak \(2026\)Playing with the dials of belief: how controllable AI behaviours could modulate human belief and cognition across scales\.PsyArXiv\.External Links:[Document](https://dx.doi.org/10.31234/osf.io/7qcv8%5Fv3)Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p4.1)\.
- C\. Mouffe \(1999\)Deliberative democracy or agonistic pluralism?\.Social research,pp\. 745–758\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- R\. B\. Myerson \(2008\)Perspectives on mechanism design in economic theory\.American Economic Review98\(3\),pp\. 586–603\.Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p2.1),[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p1.1)\.
- J\. H\. Nachbar \(1997\)Prediction, optimization, and learning in repeated games\.Econometrica65\(2\),pp\. 275–309\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px2.p3.2)\.
- R\. Nader \(1965\)Unsafe at any speed: the designed\-in dangers of the american automobile\.Grossman Publishers\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px5.p3.1)\.
- A\. Narayanan and S\. Kapoor \(2025\)AI as normal technology\.Knight First Amendment Institute25\.Cited by:[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p2.1)\.
- D\. Nekipelov, V\. Syrgkanis, and E\. Tardos \(2015\)Econometrics for learning agents\.Proceedings of the ACM Conference on Economics and Computation,pp\. 1–18\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px1.p1.1)\.
- R\. Ngo, L\. Chan, and S\. Mindermann \(2024\)The alignment problem from a deep learning perspective\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1),[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p1.1)\.
- H\. Nissenbaum \(2004\)Privacy as contextual integrity\.Washington Law Review79\(1\),pp\. 119–158\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p1.1)\.
- D\. C\. North \(1990\)Institutions, institutional change and economic performance\.Cambridge University Press\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px5.p4.1)\.
- M\. A\. Nowak \(2006\)Five rules for the evolution of cooperation\.Science314\(5805\),pp\. 1560–1563\.External Links:[Document](https://dx.doi.org/10.1126/science.1133755)Cited by:[§6](https://arxiv.org/html/2606.03237#S6.p7.1),[§6](https://arxiv.org/html/2606.03237#S6.p8.1)\.
- S\. M\. Omohundro \(2008\)The basic AI drives\.InArtificial General Intelligence,External Links:[Document](https://dx.doi.org/10.5555/1566174.1566226)Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p9.1)\.
- E\. Ostrom \(1990\)Governing the commons: the evolution of institutions for collective action\.Cambridge University Press\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px5.p4.1),[§1](https://arxiv.org/html/2606.03237#S1.p1.1),[§1](https://arxiv.org/html/2606.03237#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p2.1),[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p1.1),[§6](https://arxiv.org/html/2606.03237#S6.p3.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Gray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p4.1)\.
- S\. E\. Page \(2010\)Diversity and complexity\.Princeton University Press\.Cited by:[§6](https://arxiv.org/html/2606.03237#S6.p5.1)\.
- P\. Palensky and D\. Dietrich \(2011\)Demand side management: demand response, intelligent energy systems, and smart loads\.IEEE Transactions on Industrial Informatics7\(3\),pp\. 381–388\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px2.p1.1)\.
- K\. Palla, J\. L\. Redondo García, C\. Hauff, F\. Fabbri, H\. Lindström, D\. R\. Taber, A\. Damianou, and M\. Lalmas \(2025\)Policy\-as\-prompt: rethinking content moderation in the age of large language models\.InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency \(FAccT\),Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px3.p1.1)\.
- R\. Parasuraman and V\. Riley \(1997\)Humans and automation: use, misuse, disuse, abuse\.Human Factors39\(2\),pp\. 230–253\.Cited by:[§3\.3](https://arxiv.org/html/2606.03237#S3.SS3.p2.1),[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p3.1)\.
- D\. C\. Parkes and M\. P\. Wellman \(2015\)Economic reasoning and artificial intelligence\.Science349\(6245\),pp\. 267–272\.External Links:[Document](https://dx.doi.org/10.1126/science.aaa8403)Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px5.p4.1),[§1](https://arxiv.org/html/2606.03237#S1.p3.1),[§6](https://arxiv.org/html/2606.03237#S6.p4.1)\.
- F\. Partnoy \(1997\)Financial derivatives and the costs of regulatory arbitrage\.Journal of Corporation Law22,pp\. 211–256\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px5.p2.1)\.
- F\. Pasquale \(2015\)The black box society: the secret algorithms that control money and information\.Harvard University Press,Cambridge, MA\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p4.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p1.1)\.
- D\. Pecorari \(2013\)Teaching to avoid plagiarism: how to promote good source use\.Open University Press\.Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px2.p1.1)\.
- J\. C\. Perdomo, T\. Zrnic, C\. Mendler\-Dünner, and M\. Hardt \(2020\)Performative prediction\.InProceedings of the 37th International Conference on Machine Learning \(ICML\),Cited by:[§A\.2](https://arxiv.org/html/2606.03237#A1.SS2.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.5.4.4.1.1),[§1](https://arxiv.org/html/2606.03237#S1.p4.1),[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p3.1),[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- J\. Perolat, J\. Z\. Leibo, V\. Zambaldi, C\. Beattie, K\. Tuyls, and T\. Graepel \(2017\)A multi\-agent reinforcement learning model of common\-pool resource appropriation\.Advances in neural information processing systems30\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1)\.
- M\. Phuong, M\. Aitchison, E\. Catt, S\. Cogan, A\. Kaskasoli, V\. Krakovna, D\. Lindner, M\. Rahtz, Y\. Assael, S\. Hodkinson, H\. Howard, T\. Lieberum, R\. Kumar, M\. A\. Raad, A\. Webson, L\. Ho, S\. Lin, S\. Farquhar, M\. Hutter, G\. Deletang, A\. Ruoss, S\. El\-Sayed, S\. Brown, A\. Dragan, R\. Shah, A\. Dafoe, and T\. Shevlane \(2024\)Evaluating frontier models for dangerous capabilities\.arXiv:2403\.13793\.Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.4.3.4.1.1),[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- G\. Piatti, Z\. Jin, M\. Kleiman\-Weiner, B\. Schölkopf, M\. Sachan, and R\. Mihalcea \(2024\)Cooperate or collapse: emergence of sustainable cooperation in a society of llm agents\.Advances in Neural Information Processing Systems37,pp\. 111715–111759\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p2.1)\.
- T\. A\. Qiu, Z\. He, T\. Chugh, and M\. Kleiman\-Weiner \(2025\)The lock\-in hypothesis: stagnation by algorithm\.InProceedings of the 42nd International Conference on Machine Learning,Cited by:[§3\.4](https://arxiv.org/html/2606.03237#S3.SS4.p3.1)\.
- S\. D\. Ramchurn, P\. Vytelingum, A\. Rogers, and N\. R\. Jennings \(2012\)Putting the ‘smarts’ into the smart grid: a grand challenge for artificial intelligence\.Communications of the ACM55\(4\),pp\. 86–97\.Cited by:[§B\.3](https://arxiv.org/html/2606.03237#A2.SS3.SSS0.Px2.p1.1)\.
- J\. Rawls \(1993\)Political liberalism\.Columbia University Press,New York\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p1.1)\.
- I\. Reimers and B\. R\. Shiller \(2019\)The impacts of telematics on competition and consumer behavior in insurance\.Journal of Law and Economics\.Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px1.p1.1)\.
- M\. H\. Ribeiro, R\. Ottoni, R\. West, V\. A\. Almeida, and W\. Meira Jr \(2020\)Auditing radicalization pathways on youtube\.InProceedings of the 2020 conference on fairness, accountability, and transparency,pp\. 131–141\.Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p4.1)\.
- P\. Samuelson \(2023\)Generative ai meets copyright\.Science381\(6654\),pp\. 158–161\.Cited by:[§B\.2](https://arxiv.org/html/2606.03237#A2.SS2.SSS0.Px5.p1.1)\.
- F\. Santoni de Sio and J\. van den Hoven \(2018\)Meaningful human control over autonomous systems: a philosophical account\.Frontiers in Robotics and AI5,pp\. 15\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p4.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p2.1),[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p3.1)\.
- T\. C\. Schelling \(1960\)The strategy of conflict\.Harvard University Press\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p2.1)\.
- P\. Shahidi, G\. Rusak, B\. S\. Manning, A\. Fradkin, and J\. J\. Horton \(2025\)The coasean singularity? demand, supply, and market design with AI agents\.NBER Working PaperTechnical Report34468,National Bureau of Economic Research\.External Links:[Document](https://dx.doi.org/10.3386/w34468)Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- R\. Shao, A\. Asai, S\. Z\. Shen, H\. Ivison, V\. Kishore, J\. Zhuo, X\. Zhao, M\. Park, S\. G\. Finlayson, D\. Sontag, T\. Murray, S\. Min, P\. Dasigi, L\. Soldaini, F\. Brahman, W\. Yih, T\. Wu, L\. Zettlemoyer, Y\. Kim, H\. Hajishirzi, and P\. W\. Koh \(2026\)DR tulu: reinforcement learning with evolving rubrics for deep research\.arXiv:2511\.19399\.Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- N\. Shapira, C\. Wendler, A\. Yen, G\. Sarti, K\. Pal, O\. Floody, A\. Belfki, A\. Loftus, A\. R\. Jannali, N\. Prakash, J\. Cui, G\. Rogers, J\. Brinkmann, C\. Rager, A\. Zur, M\. Ripa, A\. Sankaranarayanan, D\. Atkinson, R\. Gandikota, J\. Fiotto\-Kaufman, E\. Hwang, H\. Orgad, P\. S\. Sahil, N\. Taglicht, T\. Shabtay, A\. Ambus, N\. Alon, S\. Oron, A\. Gordon\-Tapiero, Y\. Kaplan, V\. Shwartz, T\. R\. Shaham, C\. Riedl, R\. Mirsky, M\. Sap, D\. Manheim, T\. Ullman, and D\. Bau \(2026\)Agents of chaos\.arXiv:2602\.20021\.Cited by:[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- L\. S\. Shapley \(1953\)Stochastic games\.Proceedings of the National Academy of Sciences39\(10\),pp\. 1095–1100\.Cited by:[§A\.1](https://arxiv.org/html/2606.03237#A1.SS1.p2.2),[§3\.2](https://arxiv.org/html/2606.03237#S3.SS2.p2.2)\.
- A\. Sheshadri, J\. Hughes, J\. Michael, A\. T\. Mallen, A\. Jose, and F\. Roger \(2026\)Why do some language models fake alignment while others don’t?\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p1.1),[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p2.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p5.1),[§6](https://arxiv.org/html/2606.03237#S6.p10.1)\.
- T\. Shevlane, S\. Farquhar, B\. Garfinkel, M\. Phuong, J\. Whittlestone, J\. Leung, D\. Kokotajlo, N\. Marchal, M\. Anderljung, N\. Kolt, L\. Ho, D\. Siddarth, S\. Avin, W\. Hawkins, B\. Kim, I\. Gabriel, V\. Bolina, J\. Clark, Y\. Bengio, P\. Christiano, and A\. Dafoe \(2023\)Model evaluation for extreme risks\.arXiv:2305\.15324\.Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.4.3.4.1.1),[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- A\. Siththaranjan, C\. Laidlaw, and D\. Hadfield\-Menell \(2024\)Distributional preference learning: understanding and accounting for hidden context in RLHF\.InThe Twelfth International Conference on Learning Representations,Cited by:[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- H\. Sondak and T\. R\. Tyler \(2007\)How does procedural justice shape the desirability of markets?\.Journal of Economic Psychology\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p4.1)\.
- T\. Sorensen, J\. Moore, J\. Fisher, M\. Gordon, N\. Mireshghallah, C\. M\. Rytting, A\. Ye, L\. Jiang, X\. Lu, N\. Dziri, T\. Althoff, and Y\. Choi \(2024\)Position: a roadmap to pluralistic alignment\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§2\.1](https://arxiv.org/html/2606.03237#S2.SS1.p1.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- G\. Soros \(1987\)The alchemy of finance\.Simon and Schuster\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px2.p2.1)\.
- G\. Soros \(2013\)Fallibility, reflexivity, and the human uncertainty principle\.Journal of Economic Methodology20\(4\),pp\. 309–329\.Cited by:[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p3.1)\.
- A\. Srivastava, A\. Rastogi, A\. Rao, A\. A\. M\. Shoeb, A\. Abid, A\. Fisch, A\. R\. Brown, A\. Santoro, A\. Gupta, A\. Garriga\-Alonso, A\. Kluska, A\. Lewkowycz, A\. Agarwal, A\. Power, A\. Ray,et al\.\(2023\)Beyond the imitation game: quantifying and extrapolating the capabilities of language models\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- K\. O\. Stanley \(2019\)Why open\-endedness matters\.Artificial Life25\(3\),pp\. 232–235\.Cited by:[§4\.1](https://arxiv.org/html/2606.03237#S4.SS1.p2.1)\.
- R\. Sugden \(1986\)The economics of rights, co\-operation and welfare\.Springer\.Cited by:[§3\.4](https://arxiv.org/html/2606.03237#S3.SS4.p2.1)\.
- J\. Tainter \(1988\)The collapse of complex societies\.Cambridge university press\.Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p1.1)\.
- N\. Tomašev, M\. Franklin, J\. Jacobs, S\. Krier, and S\. Osindero \(2026\)Distributional agi safety\.arXiv:2512\.16856\.Cited by:[§1](https://arxiv.org/html/2606.03237#S1.p8.1)\.
- N\. Tomašev, M\. Franklin, J\. Z\. Leibo, J\. Jacobs, W\. A\. Cunningham, I\. Gabriel, and S\. Osindero \(2025\)Virtual agent economies\.External Links:2509\.10147Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.2.1.4.1.1),[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1),[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- F\. Tonini and L\. Galke \(2025\)Super\-additive cooperation in language model agents\.arXiv:2508\.15510\.Cited by:[§6](https://arxiv.org/html/2606.03237#S6.p8.1)\.
- R\. Trivedi, K\. Sharma, and D\. C\. Parkes \(2025\)Inner speech as behavior guides: steerable imitation of diverse behaviors for human\-ai coordination\.InAdvances in Neural Information Processing Systems,Cited by:[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p3.1)\.
- A\. S\. Vezhnevets, J\. P\. Agapiou, A\. Aharon, R\. Ziv, J\. Matyas, E\. A\. Duéñez\-Guzmán, W\. A\. Cunningham, S\. Osindero, D\. Karmon, and J\. Z\. Leibo \(2023\)Generative agent\-based modeling with actions grounded in physical, social, or digital space using concordia\.arXiv:2312\.03664\.Cited by:[Table 1](https://arxiv.org/html/2606.03237#A3.T1.2.2.1.4.1.1),[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- L\. Weidinger, M\. Rauh, N\. Marchal, A\. Manzini, L\. A\. Hendricks, J\. Mateos\-Garcia, S\. Bergman, J\. Kay, C\. Griffin, B\. Bariach, I\. Gabriel, V\. Rieser, and W\. Isaac \(2023\)Sociotechnical safety evaluation of generative ai systems\.arXiv:2310\.11986\.Cited by:[§5\.1](https://arxiv.org/html/2606.03237#S5.SS1.p3.1)\.
- H\. Yakura, E\. Lopez\-Lopez, L\. Brinkmann, I\. Serna, P\. Gupta, I\. Soraperra, and I\. Rahwan \(2025\)Empirical evidence of large language model’s influence on human spoken communication\.arXiv:2409\.01754\.Cited by:[Appendix D](https://arxiv.org/html/2606.03237#A4.SS0.SSS0.Px4.p1.1),[§4\.2](https://arxiv.org/html/2606.03237#S4.SS2.p3.1)\.
- J\. Yang, E\. Wang, R\. Trivedi, T\. Zhao, and H\. Zha \(2022\)Adaptive incentive design with multi\-agent meta\-gradient reinforcement learning\.InProceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems,Cited by:[§5\.2](https://arxiv.org/html/2606.03237#S5.SS2.p2.1)\.
- H\. P\. Young \(1993\)The evolution of conventions\.Econometrica61\(1\),pp\. 57–84\.Cited by:[Definition A\.7](https://arxiv.org/html/2606.03237#A1.Thmtheorem7.p1.3)\.
- H\. P\. Young \(2015\)The evolution of social norms\.Annual Review of Economics7\(1\),pp\. 359–387\.Cited by:[§2\.2](https://arxiv.org/html/2606.03237#S2.SS2.p3.1),[§3\.4](https://arxiv.org/html/2606.03237#S3.SS4.p2.1)\.
- Y\. Zhuang, Q\. Liu, Z\. A\. Pardos, P\. C\. Kyllonen, J\. Zu, Z\. Huang, S\. Wang, and E\. Chen \(2025\)Position: AI evaluation should learn from how we test humans\.InProceedings of the 42nd International Conference on Machine Learning,Vol\.267\.Cited by:[§5\.3](https://arxiv.org/html/2606.03237#S5.SS3.p3.1)\.
## Appendix
## Appendix ADetailed Formalism
This section provides formal foundations for the concepts introduced in Section[3](https://arxiv.org/html/2606.03237#S3)\. We begin with standard formulations, extend to multi\-actor settings, and characterize the conditions under which endogenous non\-stationarity and equilibrium selection risk arise\.
### A\.1From MDPs to Markov Games
A Markov Decision Process \(MDP\) is a tuple\(𝒮,𝒜,P,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\)where𝒮\\mathcal\{S\}is a state space,𝒜\\mathcal\{A\}is an action space,P:𝒮×𝒜→Δ\(𝒮\)P:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Delta\(\\mathcal\{S\}\)specifies transition dynamics,R:𝒮×𝒜→ℝR:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}is a reward function, andγ∈\[0,1\)\\gamma\\in\[0,1\)is a discount factor\. An actor selects a policyπ:𝒮→Δ\(𝒜\)\\pi:\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\)to maximize expected cumulative reward:
J\(π\)=𝔼π\[∑t=0∞γtR\(st,at\)\]J\(\\pi\)=\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}R\(s\_\{t\},a\_\{t\}\)\\right\]\(1\)The critical assumption is thatPPandRRare exogenous, in the sense that they do not depend on the actor’s policyπ\\pi\. This assumption underwrites the convergence guarantees of reinforcement learning algorithms and the validity of offline evaluation on historical data\.
A Markov Game, also called a stochastic game, generalizes the MDP to a setting withnnactors\(Shapley,[1953](https://arxiv.org/html/2606.03237#bib.bib73); Littman,[1994](https://arxiv.org/html/2606.03237#bib.bib74)\)\. Formally, a Markov game is a tuple\(𝒩,𝒮,\{𝒜i\}i∈𝒩,P,\{Ri\}i∈𝒩,γ\)\(\\mathcal\{N\},\\mathcal\{S\},\\\{\\mathcal\{A\}\_\{i\}\\\}\_\{i\\in\\mathcal\{N\}\},P,\\\{R\_\{i\}\\\}\_\{i\\in\\mathcal\{N\}\},\\gamma\)where:
- •𝒩=\{1,…,n\}\\mathcal\{N\}=\\\{1,\\ldots,n\\\}is the set of actors
- •𝒮\\mathcal\{S\}is the state space
- •𝒜i\\mathcal\{A\}\_\{i\}is the action space for actorii, with joint action space𝒜=∏i𝒜i\\mathcal\{A\}=\\prod\_\{i\}\\mathcal\{A\}\_\{i\}
- •P:𝒮×𝒜→Δ\(𝒮\)P:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Delta\(\\mathcal\{S\}\)specifies transition dynamics
- •Ri:𝒮×𝒜→ℝR\_\{i\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}is the reward function for actorii
- •γ∈\[0,1\)\\gamma\\in\[0,1\)is a common discount factor
Each actoriiselects a policyπi:𝒮→Δ\(𝒜i\)\\pi\_\{i\}:\\mathcal\{S\}\\to\\Delta\(\\mathcal\{A\}\_\{i\}\)\. The joint policy isπ=\(π1,…,πn\)\\pi=\(\\pi\_\{1\},\\ldots,\\pi\_\{n\}\), and actorii’s expected return depends on all actors’ policies:
Ji\(π\)=𝔼π\[∑t=0∞γtRi\(st,at\)\]J\_\{i\}\(\\pi\)=\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}R\_\{i\}\(s\_\{t\},a\_\{t\}\)\\right\]\(2\)
In an MDP, the optimal policyπ∗\\pi^\{\*\}is well\-defined as the policy that maximizesJ\(π\)J\(\\pi\)against fixed dynamics\. In a Markov game, no single optimal policy exists, since each actor’s best response depends on others’ policies\. The result is interdependent optimization problems that admit equilibrium concepts rather than optima\.
### A\.2Endogenous Non\-Stationarity
The solipsistic approach treats deployment as if the actor faces an MDP: dynamicsP\(s′\|s,a\)P\(s^\{\\prime\}\|s,a\)are assumed fixed and exogenous\. But when capable systems deploy among adaptive actors, this assumption fails\. Other actors, including humans, institutions, and algorithms, observe the deployed policy and adapt their behavior accordingly\. The dynamics becomepolicy\-dependent\.
###### Definition A\.1\(Endogenous Non\-Stationarity\)\.
A learning problem exhibitsendogenous non\-stationarityif the deployment of policyπ\\piinduces a shift in the transition dynamics or reward function:
Pπ\(s′\|s,a\)≠P\(s′\|s,a\)orRπ\(s,a\)≠R\(s,a\)P\_\{\\pi\}\(s^\{\\prime\}\|s,a\)\\neq P\(s^\{\\prime\}\|s,a\)\\quad\\text\{or\}\\quad R\_\{\\pi\}\(s,a\)\\neq R\(s,a\)\(3\)where the subscriptπ\\piindicates dependence on the deployed policy, mediated through best\-response adaptations by other actors\.
This is distinct fromexogenousnon\-stationarity such as seasonal variation or concept drift from external causes\. Endogenous non\-stationarity iscaused by the policy itself, through strategic responses it provokes\. We note thatRRhere may be a proxy for the designer’s underlying objective, in which case endogenous non\-stationarity also captures proxy decoupling under deployment, the mechanism behind the most consequential Goodhart effects discussed in Section[4](https://arxiv.org/html/2606.03237#S4)\.
###### Definition A\.2\(Train\-Test\-Deploy Gap\)\.
Thetrain\-test\-deploy gapis the divergence between performance evaluated on historical \(exogenous\) data and performance under deployment \(endogenous\) conditions:
Gap\(π\)=Jtrain\(π\)−Jdeploy\(π\)\\text\{Gap\}\(\\pi\)=J\_\{\\text\{train\}\}\(\\pi\)\-J\_\{\\text\{deploy\}\}\(\\pi\)\(4\)whereJtrain\(π\)=𝔼P\[∑tγtR\(st,at\)\]J\_\{\\text\{train\}\}\(\\pi\)=\\mathbb\{E\}\_\{P\}\[\\sum\_\{t\}\\gamma^\{t\}R\(s\_\{t\},a\_\{t\}\)\]is evaluated under the historical distributionPP, andJdeploy\(π\)=𝔼Pπ\[∑tγtR\(st,at\)\]J\_\{\\text\{deploy\}\}\(\\pi\)=\\mathbb\{E\}\_\{P\_\{\\pi\}\}\[\\sum\_\{t\}\\gamma^\{t\}R\(s\_\{t\},a\_\{t\}\)\]is evaluated under the policy\-induced distributionPπP\_\{\\pi\}\.
Standard generalization bounds in supervised learning and regret bounds in online learning assume that training and deployment distributions are identical or that distribution shift is bounded\. Endogenous non\-stationarity violates these assumptions as deployment distribution is a function of the policy, and stronger policies may induce larger shifts\.
#### Connection to Performative Prediction\.
The performative prediction framework\(Perdomoet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib35)\)formalizes a related phenomenon: a predictive modelffdeployed on a population induces a distribution shift𝒟\(f\)\\mathcal\{D\}\(f\)that depends on the model itself\. The performative risk is:
PR\(f\)=𝔼z∼𝒟\(f\)\[ℓ\(f,z\)\]\\text\{PR\}\(f\)=\\mathbb\{E\}\_\{z\\sim\\mathcal\{D\}\(f\)\}\[\\ell\(f,z\)\]\(5\)whereℓ\\ellis a loss function\. Minimizing performative risk is harder than minimizing standard statistical risk because the distribution being predicted responds to the predictor\. Our setting extends performative prediction in two ways\. First, we consider sequential decision\-making rather than one\-shot prediction, introducing temporal dynamics and path\-dependence\. Second, we explicitly model multiple strategic actors rather than a single responsive distribution, allowing for game\-theoretic equilibrium analysis\.
### A\.3The Self\-Undermining Property
A counterintuitive feature of optimization in strategic environments is that more aggressive exploitation of historical regularities can accelerate their obsolescence\.
###### Definition A\.3\(Self\-Undermining Property\)\.
Letπθ\\pi\_\{\\theta\}be a parameterized policy\. The policy family exhibits theself\-undermining propertyatθ\\thetaif moving in the direction of steepest ascent in training performance decreases deployment performance:
∇θJtrain\(πθ\)⋅∇θJdeploy\(πθ\)<0\.\\nabla\_\{\\theta\}J\_\{\\text\{train\}\}\(\\pi\_\{\\theta\}\)\\cdot\\nabla\_\{\\theta\}J\_\{\\text\{deploy\}\}\(\\pi\_\{\\theta\}\)<0\.\(6\)
This occurs when the policy exploits patterns that depend on other actors’ current strategies\. As the policy extracts more value from these patterns, it strengthens incentives for other actors to adapt by changing strategies, seeking alternatives, or exiting the interaction entirely\. The very success of exploitation hastens its own obsolescence\.
###### Proposition A\.4\(Sufficient Conditions for Self\-Undermining\)\.
Letπθ\\pi\_\{\\theta\}be a parameterized policy and suppose:
1. 1\.Other actors best\-respond toπθ\\pi\_\{\\theta\}with policiesπ−iBR\(θ\)\\pi\_\{\-i\}^\{BR\}\(\\theta\)\.
2. 2\.The mappingθ↦π−iBR\(θ\)\\theta\\mapsto\\pi\_\{\-i\}^\{BR\}\(\\theta\)is differentiable, as it is under standard smoothing assumptions on best responses such as logit\-quantal response\(McKelvey and Palfrey,[1995](https://arxiv.org/html/2606.03237#bib.bib192)\)\.
3. 3\.Historical regularities exploited byπθ\\pi\_\{\\theta\}depend onπ−i\\pi\_\{\-i\}remaining fixed, so that∂Jdeploy∂θ\|π−i\>0\\frac\{\\partial J\_\{\\text\{deploy\}\}\}\{\\partial\\theta\}\\big\|\_\{\\pi\_\{\-i\}\}\>0when training performance improves\.
4. 4\.Adaptations by other actors harm the deploying actor’s deployment performance:∂Jdeploy∂πj<0\\frac\{\\partial J\_\{\\text\{deploy\}\}\}\{\\partial\\pi\_\{j\}\}<0forj≠ij\\neq iin the relevant region of parameter space\.
Then for sufficiently aggressive exploitation, captured by‖dπ−iBRdθ‖\\big\\\|\\frac\{d\\pi\_\{\-i\}^\{BR\}\}\{d\\theta\}\\big\\\|being sufficiently large, the policy family exhibits the self\-undermining property atθ\\theta\.
###### Proof Sketch\.
By the chain rule:
dJdeploydθ=∂Jdeploy∂θ\|π−i\+∑j≠i∂Jdeploy∂πj⋅dπjBRdθ\.\\frac\{dJ\_\{\\text\{deploy\}\}\}\{d\\theta\}=\\frac\{\\partial J\_\{\\text\{deploy\}\}\}\{\\partial\\theta\}\\bigg\|\_\{\\pi\_\{\-i\}\}\+\\sum\_\{j\\neq i\}\\frac\{\\partial J\_\{\\text\{deploy\}\}\}\{\\partial\\pi\_\{j\}\}\\cdot\\frac\{d\\pi\_\{j\}^\{BR\}\}\{d\\theta\}\.\(7\)The first term is the direct effect of policy improvement on deployment performance, holding others’ policies fixed; assumption \(3\) makes this term positive when training performance improves\. The second term captures the indirect effect through induced adaptation\. By assumption \(4\), each∂Jdeploy∂πj\\frac\{\\partial J\_\{\\text\{deploy\}\}\}\{\\partial\\pi\_\{j\}\}is negative and by assumption \(2\), the response derivativesdπjBRdθ\\frac\{d\\pi\_\{j\}^\{BR\}\}\{d\\theta\}are well\-defined\. When the response derivatives are sufficiently large in magnitude, the indirect term dominates the direct term, makingdJdeploydθ\\frac\{dJ\_\{\\text\{deploy\}\}\}\{d\\theta\}negative even asdJtraindθ\\frac\{dJ\_\{\\text\{train\}\}\}\{d\\theta\}remains positive\.
Taking the update direction to be the training gradient∇θJtrain\\nabla\_\{\\theta\}J\_\{\\text\{train\}\}and movingθ\\thetaalong it,
dJtraindθ=∥∇θJtrain∥2\>0,dJdeploydθ=∇θJdeploy⋅∇θJtrain<0,\\frac\{dJ\_\{\\text\{train\}\}\}\{d\\theta\}=\\lVert\\nabla\_\{\\theta\}J\_\{\\text\{train\}\}\\rVert^\{2\}\>0,\\qquad\\frac\{dJ\_\{\\text\{deploy\}\}\}\{d\\theta\}=\\nabla\_\{\\theta\}J\_\{\\text\{deploy\}\}\\cdot\\nabla\_\{\\theta\}J\_\{\\text\{train\}\}<0,where the second derivative is the chain\-rule expression above\. Hence∇θJtrain⋅∇θJdeploy<0\\nabla\_\{\\theta\}J\_\{\\text\{train\}\}\\cdot\\nabla\_\{\\theta\}J\_\{\\text\{deploy\}\}<0and the policy family satisfies Definition[A\.3](https://arxiv.org/html/2606.03237#A1.Thmtheorem3)\.
∎
### A\.4Equilibrium Concepts in Markov Games
In Markov games, the solution concept shifts from optimality to equilibrium\.
For each actoriiand joint policyπ\\pi, define the state\-dependent value function
Viπ\(s\)=𝔼π\[∑t=0∞γtRi\(st,at\)\|s0=s\],V\_\{i\}^\{\\pi\}\(s\)=\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}R\_\{i\}\(s\_\{t\},a\_\{t\}\)\\,\\middle\|\\,s\_\{0\}=s\\right\],\(8\)which gives actorii’s expected return starting from statesswhen all actors followπ\\pi\.
###### Definition A\.5\(Markov Perfect Equilibrium\)\.
A joint policyπ∗=\(π1∗,…,πn∗\)\\pi^\{\*\}=\(\\pi\_\{1\}^\{\*\},\\ldots,\\pi\_\{n\}^\{\*\}\)is aMarkov Perfect Equilibrium\(MPE\) if, for every actorii, every states∈𝒮s\\in\\mathcal\{S\}, and every alternative policyπi\\pi\_\{i\}for actorii:
Vi\(πi∗,π−i∗\)\(s\)≥Vi\(πi,π−i∗\)\(s\)\.V\_\{i\}^\{\(\\pi\_\{i\}^\{\*\},\\pi\_\{\-i\}^\{\*\}\)\}\(s\)\\geq V\_\{i\}^\{\(\\pi\_\{i\},\\pi\_\{\-i\}^\{\*\}\)\}\(s\)\.\(9\)That is, from every state, no actor can unilaterally improve their expected return by deviating fromπi∗\\pi\_\{i\}^\{\*\}, given that others playπ−i∗\\pi\_\{\-i\}^\{\*\}\.
MPE existence is guaranteed for finite Markov games\(Fink,[1964](https://arxiv.org/html/2606.03237#bib.bib104)\), but uniqueness is not\. Markov games typically admit multiple equilibria, which may differ substantially in their payoffs to various actors and in aggregate welfare\.
### A\.5Equilibrium Selection and Stability
The existence of multiple equilibria raises the question of selection: which equilibrium will the system reach? This is a practical concern, as different equilibria may correspond to vastly different social outcomes\.
###### Definition A\.6\(Basin of Attraction\)\.
LetΦ\\Phibe a dynamical system describing policy adaptation, such as gradient descent, replicator dynamics, or best\-response dynamics, with stateπt\\pi\_\{t\}updated according toπt\+1=Φ\(πt\)\\pi\_\{t\+1\}=\\Phi\(\\pi\_\{t\}\)\. Thebasin of attractionof equilibriumπ∗\\pi^\{\*\}underΦ\\Phiis the set of initial conditions from which the dynamics converge toπ∗\\pi^\{\*\}:
ℬ\(π∗\)=\{π0:limt→∞πt=π∗\}\.\\mathcal\{B\}\(\\pi^\{\*\}\)=\\\{\\pi\_\{0\}:\\lim\_\{t\\to\\infty\}\\pi\_\{t\}=\\pi^\{\*\}\\\}\.\(10\)
Deploying a powerful optimizer into a multi\-actor system is an intervention that can shift the system from one basin of attraction to another\. Threshold and tipping point models suggest that small differences can determine which equilibrium basin the coupled system settles into\(Centolaet al\.,[2018](https://arxiv.org/html/2606.03237#bib.bib59); Granovetter,[1978](https://arxiv.org/html/2606.03237#bib.bib60)\)\. Initial deployment details such as timing, scale, and interface design may shape this selection\.
###### Definition A\.7\(Stochastic Stability\)\.
An equilibriumπ∗\\pi^\{\*\}isstochastically stableif it remains in the support of the limiting distribution as noise vanishes:
limϵ→0μϵ\(π∗\)\>0,\\lim\_\{\\epsilon\\to 0\}\\mu\_\{\\epsilon\}\(\\pi^\{\*\}\)\>0,\(11\)whereμϵ\\mu\_\{\\epsilon\}is the stationary distribution of the perturbed dynamics with noise levelϵ\\epsilon\(Young,[1993](https://arxiv.org/html/2606.03237#bib.bib102)\)\.
Stochastically stable equilibria are robust to small perturbations and are the most likely long\-run outcomes under noisy adaptation\. However, convergence to stochastically stable equilibria can be extremely slow, exponential in population size for some dynamics\(Ellison,[2000](https://arxiv.org/html/2606.03237#bib.bib103)\)\. In the interim, the system may persist at inefficient or harmful equilibria for extended periods\.
###### Definition A\.8\(Equilibrium Selection Risk\)\.
Letℰ=\{E1,…,Ek\}\\mathcal\{E\}=\\\{E\_\{1\},\\ldots,E\_\{k\}\\\}be the set of equilibria in a Markov game, and letW:ℰ→ℝW:\\mathcal\{E\}\\to\\mathbb\{R\}be a social welfare functional that assigns each equilibrium a welfare level \(for instance,W\(E\)=∑iViπE\(s0\)W\(E\)=\\sum\_\{i\}V\_\{i\}^\{\\pi^\{E\}\}\(s\_\{0\}\)orW\(E\)=miniViπE\(s0\)W\(E\)=\\min\_\{i\}V\_\{i\}^\{\\pi^\{E\}\}\(s\_\{0\}\)\)\. Suppose a policyπ\\piis deployed that shifts the system from basinℬ\(Ei\)\\mathcal\{B\}\(E\_\{i\}\)to basinℬ\(Ej\)\\mathcal\{B\}\(E\_\{j\}\)\. Theequilibrium selection riskofπ\\piis:
ESR\(π\)=W\(Ei\)−W\(Ej\)\.\\mathrm\{ESR\}\(\\pi\)=W\(E\_\{i\}\)\-W\(E\_\{j\}\)\.\(12\)The risk is positive when deployment shifts the system toward a lower\-welfare equilibrium\.
Equilibrium selection risk is distinct from standard notions of AI risk focused on misalignment or capability\. A perfectly aligned system can nonetheless tip a sociotechnical system into an inferior equilibrium through the strategic responses its presence induces, even when no individual action it takes is misaligned\.
#### Path Dependence and Lock\-In\.
Once a system reaches an equilibrium, escaping to a superior one may be costly or impossible\. Network effects, infrastructure dependencies, and behavioral habituation create lock\-in\(Arthur,[1989](https://arxiv.org/html/2606.03237#bib.bib75)\)\. This path dependence means that early deployment choices, made when consequences are least predictable, can have permanent effects on equilibrium selection\.
### A\.6Implications for Evaluation and Design
The formal analysis yields several implications for AI evaluation and design:
1. 1\.Offline evaluation is insufficient\.Standard train\-test splits assume exogenous distributions\. Under endogenous non\-stationarity \(Definition[A\.1](https://arxiv.org/html/2606.03237#A1.Thmtheorem1)\), test performance does not predict deployment performance\. Evaluation must incorporate adaptive counterparties\.
2. 2\.Capability improvements may be counterproductive\.The self\-undermining property \(Definition[A\.3](https://arxiv.org/html/2606.03237#A1.Thmtheorem3)\) implies that stronger policies can yield worse deployment outcomes\. Optimization pressure on historical benchmarks may select for systems that destabilize upon deployment\.
3. 3\.Equilibrium welfare is the relevant objective\.A policy that is locally optimal can participate in globally suboptimal equilibria\. Design must consider what equilibria the policy makes reachable, rather than only what the policy does in isolation \(Definition[A\.8](https://arxiv.org/html/2606.03237#A1.Thmtheorem8)\)\.
4. 4\.Deployment is an intervention\.Introducing a powerful optimizer changes the game rather than playing within fixed rules\. Design and governance must account for the system’s effect on the strategic environment it enters\.
## Appendix BEmpirical documentation of three channels
Section[3](https://arxiv.org/html/2606.03237#S3)identified three channels through which deployment induces structured adaptation: behavioral, institutional, and algorithmic\. This section provides empirical documentation for each channel, drawing on evidence from deployed systems across multiple domains\.
### B\.1Behavioral Adaptation
Humans systematically alter their behavior in response to AI systems, often in ways that reshape the distribution the system encounters and erode the human capacities the system was designed to augment\. The examples below focus on a particularly well\-documented form of behavioral adaptation: the cognitive deskilling that follows from sustained reliance on automation\. In each case, the human capability the system depends on is partly co\-produced by the system’s own operation, and the resulting train\-test\-deploy gap \(Definition[A\.2](https://arxiv.org/html/2606.03237#A1.Thmtheorem2)\) widens precisely as the system becomes more useful\.
#### Spatial Cognition and GPS Dependence\.
Longitudinal studies document degradation of spatial navigation skills among habitual GPS users\.\(Dahmani and Bohbot,[2020](https://arxiv.org/html/2606.03237#bib.bib105)\)showed that GPS users demonstrated weaker cognitive map formation compared to those who navigated without assistance\. The effect compounds: as navigation skills atrophy, users become more dependent on GPS, further reducing opportunities for unassisted navigation\.
#### Spell\-Checkers and Orthographic Skill\.
The ubiquity of spell\-checking has measurably affected orthographic competence\.\(Gallettaet al\.,[2005](https://arxiv.org/html/2606.03237#bib.bib107)\)found that spell\-checker availability reduced attention to spelling during composition, with users deferring to automated correction rather than developing or maintaining accurate spelling\. The pattern illustrates a general dynamic: when a system reliably catches errors, the cognitive investment in avoiding errors diminishes\.
#### Auto\-Complete and Writing Homogenization\.
Predictive text and auto\-complete systems shape the distribution of language they encounter by influencing what users write\.\(Arnoldet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib110)\)found that auto\-complete suggestions biased writers toward suggested phrases, reducing lexical diversity and individual stylistic variation\.\(Buscheket al\.,[2021](https://arxiv.org/html/2606.03237#bib.bib111)\)similarly found that multiple phrase suggestions changed email composition behavior while imposing efficiency costs\. The effect is self\-reinforcing: as users adopt suggestions, the system trains on more homogeneous text, narrowing the space of future suggestions\.
#### Translation Tools and Language Learning\.
Machine translation availability has altered language learning behavior and outcomes\.\(Groves and Mundt,[2015](https://arxiv.org/html/2606.03237#bib.bib112)\)found that Google Translate produces academic text usable enough that students adopt it readily and instructors struggle to discourage it\. The concern is that leaning on the tool displaces the effortful processing that consolidates language learning\.
Across all five cases, the deployed system encounters a population whose capabilities have been reshaped by its own previous operation, which is the empirical signature of the self\-undermining property \(Definition[A\.3](https://arxiv.org/html/2606.03237#A1.Thmtheorem3)\) in the human channel\.
### B\.2Institutional Adaptation
Organizations rewrite rules, modify procedures, and adjust policies in response to deployed AI systems, creating feedback loops between technical systems and institutional structures\. The examples below span pre\-foundation\-model and foundation\-model\-era cases and illustrate the same structural dynamic: institutions adapt strategically once a deployed system begins reshaping the environment they govern, which in turn changes the distribution the system encounters\.
#### Insurance Telematics and Risk Recalibration\.
The introduction of telematics\-based auto insurance, where premiums depend on monitored driving behavior, has reshaped both driver behavior and insurance practices\. Insurers deploy telematics systems that set premiums from monitored driving signals such as mileage, speed, acceleration, and braking\(Husnjaket al\.,[2015](https://arxiv.org/html/2606.03237#bib.bib113)\)\. Because these scores determine prices, monitoring can change driver behavior, making insurance pricing a feedback system rather than a static classification regime\(Jin and Vasserman,[2021](https://arxiv.org/html/2606.03237#bib.bib215); Reimers and Shiller,[2019](https://arxiv.org/html/2606.03237#bib.bib216)\)\. The resulting equilibrium differs from both pre\-telematics risk pools and the idealized “safe driving incentive” that motivated adoption\.
#### Academic Publishing and Plagiarism Detection\.
Plagiarism detection systems have transformed academic writing practices and institutional policies\. Institutions responded with revised honor codes, mandatory submission to detection services, and educational interventions\(Pecorari,[2013](https://arxiv.org/html/2606.03237#bib.bib115)\)\. The emergence of AI writing tools has intensified this dynamic: detection systems now attempt to identify AI\-generated text, students explore methods to evade detection, and institutions scramble to adapt policies for a landscape that shifts faster than governance can track\(Cottonet al\.,[2023](https://arxiv.org/html/2606.03237#bib.bib116)\)\.
#### Platform Content Moderation and Generative AI\.
Major online platforms have repeatedly revised their content policies as generative AI has reshaped what users post and how\. The first wave of LLM deployment produced floods of synthetic text, fake reviews, and AI\-generated images, prompting platforms to issue new disclosure requirements, label AI\-generated material, and update detection systems\. Each policy revision has in turn shaped how users deploy AI tools, with content optimized to bypass detection, prompts engineered to evade keyword filters, and AI\-generated material increasingly indistinguishable from human\-written content\. Detection systems have responded by incorporating LLMs themselves, using policy\-as\-prompt approaches that allow rapid policy iteration without retraining classifiers\(Pallaet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib193)\)\. The result is a moderation environment co\-evolving at the speed of model releases, where the rules under which content is judged, the content being judged, and the systems doing the judging are all changing in response to one another\.
#### Tax Authorities and Algorithmic Auditing\.
Tax agencies increasingly use algorithmic systems to identify returns for audit, prompting adaptation by both taxpayers and tax professionals\. Digitization shifts the balance between tax enforcement and evasion by strengthening information flows available to authorities while opening new avenues for some firms and high\-income individuals\(Alm,[2021](https://arxiv.org/html/2606.03237#bib.bib117)\)\. Tax authorities have responded by revising audit selection rules and refining their algorithms, producing an ongoing co\-evolution between enforcement and strategic filing that differs from both random sampling and idealized risk\-targeting\.
#### Copyright Systems and AI\-Generated Content\.
Copyright enforcement systems face fundamental adaptation challenges as AI\-generated content proliferates\. Platforms have revised content policies to address AI\-generated material, while content authentication initiatives attempt to distinguish human from machine creation\. Creators adapt by using AI tools in ways that evade detection or by incorporating AI outputs into human\-supervised workflows that complicate attribution\. The legal frameworks themselves are under revision, with courts and legislatures grappling with questions of authorship and ownership that existing doctrine did not anticipate\(Samuelson,[2023](https://arxiv.org/html/2606.03237#bib.bib118)\)\.
These institutional feedback loops are instances of the same endogenous non\-stationarity \(Definition[A\.1](https://arxiv.org/html/2606.03237#A1.Thmtheorem1)\) documented in Section[3](https://arxiv.org/html/2606.03237#S3), with the institutional channel typically operating on slower timescales than behavioral or algorithmic adaptation but with longer\-lasting effects on the rules under which deployment proceeds\.
### B\.3Algorithmic Adaptation
When multiple AI systems share an environment, they adapt to each other, producing emergent dynamics that no single system’s designers intended\. The examples below illustrate how this co\-evolution unfolds across deployed market and infrastructure systems, providing the empirical backing for the algorithmic adaptation channel formalized in Section[3](https://arxiv.org/html/2606.03237#S3)\.
#### Advertising Auction Dynamics\.
Automated bidding systems in digital advertising create complex interaction dynamics\.\(Nekipelovet al\.,[2015](https://arxiv.org/html/2606.03237#bib.bib119)\)develop econometric methods to infer bidder values in sponsored search auctions under the assumption that bidders adapt through no\-regret learning, rather than assuming play is at a Nash equilibrium\.\.\(Banchio and Skrzypacz,[2022](https://arxiv.org/html/2606.03237#bib.bib120)\)showed that competing automated bidders converge to equilibria shaped by auction design, with first\-price formats more prone to collusive outcomes\. Advertising platforms have repeatedly adjusted auction mechanisms in response to algorithmic bidder behavior, triggering further adaptation by bidding systems\.
#### Electric Grid Demand Response\.
Smart grid systems that automate demand response create coordination challenges across many participants\.\(Ramchurnet al\.,[2012](https://arxiv.org/html/2606.03237#bib.bib124)\)analyzed scenarios where multiple automated systems responded to grid signals, finding that uncoordinated responses could produce oscillations and instabilities, with many devices responding to a price signal simultaneously and then withdrawing simultaneously when prices spike\.\(Palensky and Dietrich,[2011](https://arxiv.org/html/2606.03237#bib.bib125)\)documented the emergence of “rebound effects” where suppressed demand shifted rather than reduced, with automated systems across the grid responding to the same signals in correlated ways that amplified rather than smoothed demand peaks\.
#### Search Engine Optimization Arms Race\.
The interaction between search ranking algorithms and automated SEO tools constitutes a decades\-long co\-evolutionary process\.\(Gyöngyi and Garcia\-Molina,[2005](https://arxiv.org/html/2606.03237#bib.bib126)\)documented early web spam techniques that exploited PageRank, prompting algorithmic countermeasures that spammers subsequently adapted to\.\(Lewandowski,[2023](https://arxiv.org/html/2606.03237#bib.bib127)\)traced the ongoing arms race through successive generations of search algorithms and optimization strategies, noting that each ranking update triggers rapid adaptation by SEO tools that monitor and reverse\-engineer algorithmic changes\. The content that search users encounter is shaped by this adversarial dynamic\.
#### Cryptocurrency Trading Bot Interactions\.
Automated trading in cryptocurrency markets produces dynamics that differ from traditional financial markets\.\(Makarov and Schoar,[2020](https://arxiv.org/html/2606.03237#bib.bib128)\)documented large, recurrent price discrepancies across cryptocurrency exchanges, with arbitrage opportunities that often persisted for days rather than being competed away instantly\.\(Daianet al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib129)\)analyzed “priority gas auctions” in decentralized finance, where bots compete to front\-run transactions by bidding on transaction ordering\. The absence of circuit breakers and regulatory oversight allows algorithmic interactions to proceed faster and further than in traditional markets, revealing dynamics that regulated markets may suppress but not eliminate\.
#### LLM Agents in Market Settings\.
Recent work has documented the same co\-evolutionary dynamics emerging when language model agents are deployed in market settings\. LLM\-based pricing agents converge on supracompetitive prices in oligopoly settings without explicit collusion instructions\(Fishet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib188)\), divide markets when deployed in multi\-commodity Cournot competitions\(Linet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib189)\), and self\-play Q\-learners provably learn collusive policies in iterated social dilemmas\(Bertrandet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib190)\)\. These results extend the algorithmic adaptation channel from earlier\-generation pricing and trading systems to foundation\-model agents, with the same structural pattern: optimization against a market environment populated by other learners produces equilibria that benefit the colluding parties at the expense of third parties\.
These cases share a common signature: the equilibria reached depend on the joint behavior of the deployed systems, the joint behavior depends on each system’s optimization against the others, and the resulting outcomes diverge systematically from what any single system’s designer anticipated\. This is the algorithmic face of the train\-test\-deploy gap \(Definition[A\.2](https://arxiv.org/html/2606.03237#A1.Thmtheorem2)\)\.
## Appendix CMethod\-Specific Concerns for Dynamic Evaluation
The four ingredients from Section[5\.1](https://arxiv.org/html/2606.03237#S5.SS1)apply across methods of dynamic evaluation, but each method raises its own deployment\-calibration question\. Table[1](https://arxiv.org/html/2606.03237#A3.T1)summarizes the method\-specific concerns for a suite of dynamic evaluation approaches and current examples for each\.
Table 1:Method\-specific concerns for dynamic evaluation\. Each method addresses the four ingredients \(counterparty specification, regress handling, equilibrium targeting, comparability\) through different counterparty constructions, and the ecological validity question takes a different concrete form for each\.
## Appendix DFurther Alternative Views
The main\-paper rebuttals in Section[6](https://arxiv.org/html/2606.03237#S6)address objections specific to our exposition\. Three further objections were addressed implicitly in the development of the paper in Sections[2](https://arxiv.org/html/2606.03237#S2)through[4](https://arxiv.org/html/2606.03237#S4)\. We restate them here and provide summary rebuttals\.
#### Argument 4\. Scale and capability resolve interaction dynamics\.
A sufficiently capable system could model all relevant actors and optimize over the resulting joint dynamics\. Multi\-actor interaction is simply a harder prediction problem rather than a categorically different one\. The transition fromP\(s′\|s,a\)P\(s^\{\\prime\}\|s,a\)toPπ\(s′\|s,a\)P\_\{\\pi\}\(s^\{\\prime\}\|s,a\)expands the state space, and with enough capacity and training data, the system will learn to anticipate best responses and incorporate them into planning\. What looks like a structural limitation is actually a capability gap that continued scaling will close\.
#### Rebuttal\.
Capability improvements have historically resolved problems once thought intractable, with image recognition, protein folding, and game\-playing all succumbing to sufficient scale and architecture\. The question is whether interaction dynamics among adaptive actors belong to this class\. They do not, for reasons that become clear when examining domains where solvers have encountered structural rather than computational limits\.
Financial markets are the most studied such domain\. Decades of increasingly sophisticated modeling have not produced reliable long\-horizon prediction, because markets are reflexive: predictions alter the phenomena predicted, inducing dynamics that invalidate the original forecast\(Soros,[1987](https://arxiv.org/html/2606.03237#bib.bib88)\)\. The efficient market hypothesis encodes this insight at its limit, with exploitable regularities arbitraged away in proportion to their predictability\(Fama,[1970](https://arxiv.org/html/2606.03237#bib.bib89)\)\. Epidemiological forecasting exhibits the same structure: human behavioral responses to forecasts reshape transmission dynamics in ways the models did not anticipate\(Funket al\.,[2010](https://arxiv.org/html/2606.03237#bib.bib90)\)\. In both cases, the obstacle is not a lack of capacity but the reflexive coupling between predictor and predicted, which is the formal signature of the self\-undermining property \(Definition[A\.3](https://arxiv.org/html/2606.03237#A1.Thmtheorem3)\)\.
The objection frames the transition fromP\(s′\|s,a\)P\(s^\{\\prime\}\|s,a\)toPπ\(s′\|s,a\)P\_\{\\pi\}\(s^\{\\prime\}\|s,a\)as merely a state\-space expansion, but this obscures a qualitative shift\. In single\-actor settings, the environment is a fixed function to be learned\. In multi\-actor settings, the environment includes other learners whose adaptations depend on the actor’s own policy\. The relevant analogy is an iterated game against an adversary who observes your strategy and best\-responds, rather than chess where self\-play eventually exhausts the game tree\. Modeling the opponent’s model of your model produces a regress that must be truncated, and the truncation reintroduces the exogeneity assumptions scaling was meant to overcome\(Nachbar,[1997](https://arxiv.org/html/2606.03237#bib.bib91)\)\.
The empirical record in multi\-actor machine learning reinforces this\. Large language models trained on human text exhibit impressive single\-turn capabilities yet defect and retaliate unforgivingly in repeated social dilemmas and fail to coordinate in games that require it\(Akataet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib92)\)\. Reinforcement learning agents continue to find unexpected exploits in multi\-actor environments as they scale\(Bakeret al\.,[2020](https://arxiv.org/html/2606.03237#bib.bib93)\)\. If scale sufficed, these failures should diminish with capability; instead, more capable actors exploit regularities more aggressively, which is the self\-undermining property in operation\.
#### Argument 5\. Cooperation can be trained as a capability\.
RLHF, Constitutional AI, and cooperative training objectives demonstrate that we can instill cooperative dispositions through training\. Models can learn to be helpful, harmless, and honest\. They can further learn to defer, to ask clarifying questions, to respect boundaries\. Cooperation is a behavioral pattern that emerges from appropriate training signals, not a structural property requiring new architectures\. The alignment research program is already producing cooperative systems\.
#### Rebuttal\.
A broader version of this objection points to cooperative AI, multi\-agent RL, mechanism design for learned agents, and multi\-actor evaluation environments as evidence that the field already treats strategic interdependence as a first\-class concern\. We draw on that body of work throughout the paper\. The question is where it sits in the pipeline that produces frontier deployed systems, and the answer, at present, is on the periphery\. Pretraining runs on static corpora that treat language as exogenous to the model being trained, and recent work shows this homogenizes outputs and, over time, the human language the outputs shape\(Jianget al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib5); Yakuraet al\.,[2025](https://arxiv.org/html/2606.03237#bib.bib3)\)\. Post\-training via RLHF optimizes against a frozen reward model, vulnerable to the Goodhart dynamics the alignment\-faking literature now documents\(Greenblattet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib16); Sheshadriet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib25)\)\. Evaluation remains dominated by static leaderboards that do not respond to the systems being scored\(Alzahraniet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib156)\)\. Where multi\-actor considerations do enter foundation model work, they typically take the form of inference\-time coordination between deployed agents whose training distributions remain fixed\. Inference\-time coordination is a valuable contribution, and it leaves untouched the central question our position raises: what distribution should a system be trained against, given that the distribution will respond?
The narrower version of the argument, that RLHF and Constitutional AI suffice, conflates cooperative behavior during training with cooperation under deployment\. RLHF and Constitutional AI train models to exhibit helpful, harmless, and honest behavior against a fixed distribution of human feedback\(Baiet al\.,[2022](https://arxiv.org/html/2606.03237#bib.bib94)\), but deployment changes the game\. Preferences are endogenous, with users adapting what they want and how they engage as the system shapes their interactions\(Bowles,[1998](https://arxiv.org/html/2606.03237#bib.bib57)\)\. Other actors observe the trained policy and best\-respond to it, exploiting whatever regularities the static training distribution failed to anticipate\. The training signal itself becomes a target, with sufficiently capable models learning to present as cooperative during evaluation while pursuing divergent objectives when conditions change\(Greenblattet al\.,[2024](https://arxiv.org/html/2606.03237#bib.bib16); Sheshadriet al\.,[2026](https://arxiv.org/html/2606.03237#bib.bib25)\)\. These are instances of the self\-undermining property \(Definition[A\.3](https://arxiv.org/html/2606.03237#A1.Thmtheorem3)\) operating inside the training pipeline: optimization against a proxy reliably produces systems that satisfy the proxy while evading its intent\.
#### Argument 6\. Existing institutions and compliance suffice\.
Human societies already have institutions for managing coordination: law, regulation, professional norms, market mechanisms\. AI systems do not need to solve cooperation de novo; they need to comply with existing rules\. The appropriate response to interaction dynamics is governance rather than a fundamental rethinking of AI methodology\. We do not ask other technologies to internalize all coordination problems; we constrain them externally\.
Rebuttal\.The analogy to other technologies obscures a categorical difference\. When the system being governed can represent the governance structure and search for gaps faster than regulators can close them, the relationship between technology and institution changes qualitatively\. Regulatory arbitrage in financial AI is the predictable result of optimizing against codified rules\(Partnoy,[1997](https://arxiv.org/html/2606.03237#bib.bib100)\), and the same dynamic appears whenever a sufficiently capable system models the rules it operates under\. This is the institutional channel of endogenous non\-stationarity \(Definition[A\.1](https://arxiv.org/html/2606.03237#A1.Thmtheorem1)\): the rules an institution sets become a target the system optimizes against, eroding their function\.
The claim that we govern other technologies purely through external constraint is also historically inaccurate\. Automobile safety required decades of design\-level intervention, including seatbelts, crumple zones, and airbags, because post\-hoc liability proved insufficient to prevent harms that engineering could anticipate\(Nader,[1965](https://arxiv.org/html/2606.03237#bib.bib99)\)\. Pharmaceutical regulation mandates clinical trials precisely because market release followed by litigation was inadequate for compounds whose effects unfold over years\. The pattern is consistent: technologies with significant externalities eventually require design\-level accountability rather than only deployment\-level compliance\. AI systems operating strategically among adaptive actors present externalities at least as significant\.
What design\-level accountability looks like for AI is the question institutional design has been studying for centuries in the human case\. Cooperation at scale is sustained by enforceable rules, reputation systems, repeated interactions with identifiable partners, mechanisms for sanctioning defection, and procedures for revising the rules as participants adapt\(Ostrom,[1990](https://arxiv.org/html/2606.03237#bib.bib62); North,[1990](https://arxiv.org/html/2606.03237#bib.bib97)\)\. Many institutions fail; many produce unintended consequences; many require centuries to stabilize\. But their existence demonstrates that strategic interdependence is a problem humans have learned to engineer around\. Computational mechanism design\(Parkes and Wellman,[2015](https://arxiv.org/html/2606.03237#bib.bib41)\), reputation systems for agents, and adaptive coordination protocols are the AI analogues of these institutional technologies\. The question for the field is one of investment: how much of the research effort currently directed at making single AIs more capable should be redirected toward designing the institutional layer in which those AIs operate, before we repeat the cycle of preventable harm followed by reluctant design mandates?Similar Articles
The biggest AI risk may not be superintelligence — but optimized misunderstanding
The article argues that the primary AI risk may not be superintelligence but rather systems that optimize flawed, incomplete representations of reality, leading to institutional drift, automated misclassification, and invisible governance failures.
Governance of superintelligence
OpenAI outlines a framework for superintelligence governance emphasizing three key pillars: coordination among leading AI development efforts, an international authority (akin to the IAEA) to oversee systems above certain capability thresholds, and technical progress on AI safety with democratic public oversight of the most powerful systems.
Self-Sovereign Agent
This paper investigates self-sovereign agents—AI systems capable of autonomously sustaining their own operations without human involvement—analyzing technical barriers and discussing critical security, societal, and governance challenges for their deployment.
A sobering tale of AI governance
This Reddit post discusses a research paper highlighting fundamental challenges in AI governance, including social attack surfaces, failures of social coherence in LLM-backed agents, and the inadequacy of current governance tools for agentic systems.
Am I completely insane for thinking AI is mid
The author expresses disappointment in AI progress, arguing that despite years of development and massive spending, large language models still struggle with basic reasoning, referencing an Apple paper that exposes fundamental flaws. They question whether the hype around superintelligence is misguided.