SOM: Structured Opponent Modeling for LLM-based Agents via Structural Causal Model

arXiv cs.AI Papers

Summary

This paper introduces Structured Opponent Modeling (SOM), a framework using Structural Causal Models to decouple opponent representation from prediction for LLM-based agents in multi-agent environments. The method improves prediction accuracy and strategic decision-making by leveraging explicit causal structures rather than implicit contextual reasoning.

arXiv:2605.07301v1 Announce Type: new Abstract: Accurately predicting opponents' behavior from interactions is a fundamental capability for large language model (LLM)-based agents in multi-agent and game-theoretic environments. Existing approaches often entangle opponent modeling with prediction, relying on implicit contextual reasoning and limiting adaptability in dynamic interactions. To this end, we propose Structured Opponent Modeling (SOM), a two-stage opponent modeling framework that distinctly separates opponent model construction and opponent prediction. At the construction stage, SOM employs a Structural Causal Model (SCM), a graph-based formalism for representing dependencies among variables, to capture directed links between opponents' observations and actions, yielding an explicit and structured opponent representation. At the prediction stage, the LLM performs structured reasoning along clear pathways derived from the SCM, improving both prediction accuracy and stability. Extensive experiments on diverse multi-agent benchmarks demonstrate that SOM consistently outperforms state-of-the-art LLM-based reasoning baselines, enabling more accurate and adaptable strategic decision-making in complex and dynamic multi-agent interactions.
Original Article
View Cached Full Text

Cached at: 05/11/26, 07:16 AM

# Structured Opponent Modeling for LLM-based Agents via Structural Causal Model
Source: [https://arxiv.org/html/2605.07301](https://arxiv.org/html/2605.07301)
\\setcopyright

ifaamas\\acmConference\[AAMAS ’26\]Proc\. of the 25th International Conference on Autonomous Agents and Multiagent Systems \(AAMAS 2026\)May 25 – 29, 2026 Paphos, CyprusC\. Amato, L\. Dennis, V\. Mascardi, J\. Thangarajah \(eds\.\)\\copyrightyear2026\\acmYear2026\\acmDOI\\acmPrice\\acmISBN\\affiliation\\institutionSchool of Artificial Intelligence, University of Chinese Academy of Sciences & Institute of Automation, Chinese Academy of Sciences\\cityBeijing\\countryChina\\authornotePei Xu and Kaiqi Huang are corresponding authors\.\\affiliation\\institutionNational Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences\\cityBeijing\\countryChina\\affiliation\\institutionSchool of Artificial Intelligence, University of Chinese Academy of Sciences\\cityBeijing\\countryChina\\affiliation\\institutionSchool of Artificial Intelligence, University of Chinese Academy of Sciences\\cityBeijing\\countryChina\\affiliation\\institutionInstitute of Automation, Chinese Academy of Sciences\\cityBeijing\\countryChina\\authornotemark\[1\]\\affiliation\\institutionSchool of Artificial Intelligence, University of Chinese Academy of Sciences & Institute of Automation, Chinese Academy of Sciences\\cityBeijing\\countryChina

###### Abstract\.

Accurately predicting opponents’ behavior from interactions is a fundamental capability for large language model \(LLM\)\-based agents in multi\-agent and game\-theoretic environments\. Existing approaches often entangle opponent modeling with prediction, relying on implicit contextual reasoning and limiting adaptability in dynamic interactions\. To this end, we proposeStructuredOpponentModeling \(SOM\), a two\-stage opponent modeling framework that distinctly decouples opponent model construction and opponent prediction\. At the construction stage, SOM employs a Structural Causal Model \(SCM\), a graph\-based formalism for representing dependencies among variables, to capture directed links between opponents’ observations and actions, yielding an explicit and structured opponent representation\. At the prediction stage, the LLM performs structured reasoning along clear pathways derived from the SCM, improving both prediction accuracy and stability\. Extensive experiments on diverse multi\-agent benchmarks demonstrate that SOM consistently outperforms state\-of\-the\-art LLM\-based reasoning baselines, enabling more accurate and adaptable strategic decision\-making in complex and dynamic multi\-agent interactions\.

###### Key words and phrases:

Opponent Modeling; Large Language Models; Multi\-agent Games

## 1\.Introduction

Large Language Models \(LLMs\) have emerged as a transformative development in artificial intelligence\. By training on vast amounts of text data, they acquire extensive world knowledgesun2023headand exhibit strong reasoningimani2023mathprompterand problem\-solvingrasal2024llmabilities\. These powerful capabilities have positioned LLMs as promising candidates for autonomous agents in complex, interactive environments such as economic simulationshorton2023large;li2024econagent, collaborative taskschen2024comm, and strategic negotiationsbianchi2024well\. In these multi\-agent settings, an agent’s success critically hinges on its ability to model opponent behavior and adapt its own strategy accordinglynashed2022survey, and a lack of deep awareness of the opponent’s behavior can lead to strategies that are easily exploited or misaligned, resulting in suboptimal outcomescarroll2019utility\. This is particularly crucial in strategic reasoning scenarios characterized by complex strategic interactions and continuously evolving behaviors\.

However, current approaches tend to implicitly entangle the modeling—the process of identifying how opponents make decisions—with opponent prediction through LLM\-based contextual reasoningzhang2024proagent;xu2023exploring;guan2024richelieu;guo2023suspicion\. This approach lacks a clear, controllable reasoning path—it neither specifies how to systematically establish the link between raw observations and an opponent’s final action, nor does it guide the language model on what key intermediate reasoning processes to include, such as inferring the opponent’s beliefs or their hidden information\. Without this structural guidance, the language model’s inference process becomes difficult to control, often missing key informationliu2023lostor producing hallucinationsji2023survey\. While existing structured reasoning methods such as Tree\-of\-Thoughtyao2023treeand Graph\-of\-ThoughtBesta2024GraphofThoughtsSEenhance LLM reasoning in many tasks, they are primarily designed for static problem settings and lack mechanisms to incorporate external feedback, making them difficult to adapt to the non\-stationary nature of strategic interactionszhang2024llm\. These limitations highlight the need for new approaches that enable explicit and adaptable opponent modeling in dynamic multi\-agent settings\.

![Refer to caption](https://arxiv.org/html/2605.07301v1/x1.png)
Figure 1\.Illustrating different opponent modeling paradigms\. Unlike baselines that ignore opponent behavior or entangle modeling within implicit reasoning, SOM explicitly constructs a structured model to guide opponent prediction\.To address these challenges, we proposeStructuredOpponentModeling \(SOM\), a two\-stage framework that explicitly separates opponent model construction and opponent prediction\. This design enables LLM\-based agents to reason about opponents through a structured and controllable process rather than relying solely on implicit contextual inference\. As illustrated in Figure[1](https://arxiv.org/html/2605.07301#S1.F1), this two\-stage design offers explicit and controllable reasoning pathways, in contrast to existing LLM\-based approaches that entangle opponent modeling within contextual reasoning\.

In theopponent model constructionstage, SOM builds an explicit opponent model grounded in Structural Causal Models \(SCMs\), which provide a structural framework to organize reasoning dependencies among observable factors and opponent’s decisions\. After each opponent action, the LLM performs reflection to infer how the observed outcome may have arisen—linking the opponent’s decisions to contextual cues and hypothesizing intermediate reasoning variables that could explain this connection\. These insights are then used to progressively build and refine the SCMs, forming the explicit reasoning backbone\.

In theopponent predictionstage, the LLM performs reasoning guided by the structured dependencies captured during the construction stage to anticipate the opponent’s next action\. At each step, the model draws on reasoning examples associated with the relevant dependency in the structure, which record prior successful inferences linking observed factors to opponent behavior\. This allows the agent to continuously refine its reasoning with new observations, improving both the accuracy and adaptability of predictions in dynamic multi\-agent interactions\.

Finally, we validate the effectiveness of our approach across multiple multi\-agent game environments\. Extensive experiments demonstrate that our framework significantly outperforms existing baseline methods when facing different opponents\. Analysis of the training process further confirms that our method accurately learns opponent strategies during interactions\.

Overall, our contributions to strategic reasoning can be summarized as follows:

- •We proposeSOM, a novel opponent modeling framework that leverages Structural Causal Models \(SCMs\) to transform opponent prediction into a structured and controllable reasoning process\.
- •Within SOM, we implement two key mechanisms: a dynamic construction of the reasoning structure during interactions, and the integration of opponent\-specific reasoning knowledge into the structured dependencies\.
- •We empirically validate SOM across diverse multi\-agent environments, showing that it outperforms strong baselines and adaptively captures the behavior of different opponents over time\.

## 2\.Related work

### 2\.1\.Strategic Reasoning with LLMs

Strategic reasoningzhang2024llmrefers to the capability of an agent to analyze the opponent’s history and the game state, infer the opponent’s strategy and actions, and adjust its own strategy to select the best course of action based on these predictions\. Early work like Cicerometa2022humancombined language models with strategic reasoning, creating a conversational agent capable of playing Diplomacy\. Cicero utilized an LLM to model other players’ beliefs and intentions to predict their actions, enabling human\-level play\. Subsequent research has applied LLMs to various multi\-player games\. In social deduction games like Werewolfxu2023language;wu2024enhance, studies aim to enhance agents’ strategic abilities by enabling them to understand game mechanics and adapt to opponents’ tactics, often involving implicit opponent prediction through dialogue analysis\. Theory of Mind \(ToM\)guo2023suspicionand k\-level thinking modelszhang2024khave also been adapted to recursively infer opponents’ hidden beliefs and predict their behavior in strategic reasoning\. The EMOyu2025llmmethod simulates opponent modeling by constructing multiple agent\-specific models, but it still lacks an explicit representation of the opponent’s decision\-making process\.

While these methods leverage the powerful reasoning capabilities of LLMs and often incorporate some form of opponent action prediction, they typically treat opponent modeling as a general reasoning task\. Although some approaches may use perspective\-taking to simulate inferential processes, these often lack clear and controllable reasoning pathways\.

### 2\.2\.Structured Prompting for Reasoning

Structured prompting, a technique that guides LLMs through multi\-step reasoning by explicitly structuring the prompt format, has significantly enhanced their reasoning capabilities\. A foundational approach is Chain\-of\-Thought \(CoT\)wei2022chain, which enables LLMs to generate a series of intermediate natural language reasoning steps\. Building upon this, Self\-Consistency \(SC\)Wang2022SelfConsistencyICimproves CoT’s robustness by sampling diverse reasoning paths and aggregating results via majority voting\. To overcome the inherent linearity of CoT, Tree\-of\-Thought \(ToT\)yao2023treemodels reasoning as a tree\-like exploration, allowing for branching and backtracking\. Further generalizing this concept, Graph\-of\-Thought \(GoT\)Besta2024GraphofThoughtsSEemploys arbitrary graph structures to represent complex dependencies between thoughts\. Building on this, Diagram\-of\-Thought \(DoT\)Zhang2024OnTDallows a single LLM to internally construct and reason over DAGs using role\-specific tokens, streamlining multi\-step reasoning without external control\. Logic\-of\-Thought \(LoT\)Li2025LogicofThoughtELfurther integrates formal logic into prompts to improve consistency and deductive precision\.

While structured prompting has significantly enhanced the reasoning capabilities of LLMs, existing methods are predominantly designed for static problem settings and lack mechanisms to incorporate feedback or adapt their reasoning structures over time\. As a result, they struggle to effectively capture opponent behavior in dynamic multi\-agent environments characterized by strategic interactions and evolving behaviors\. This limitation highlights the urgent need for approaches that enable more adaptive and opponent\-aware reasoning in such settings\.

![Refer to caption](https://arxiv.org/html/2605.07301v1/x2.png)Figure 2\.Illustration of the opponent modeling pipeline of SOM\. SOM operates in two explicit stages\. First, it constructs the SCM representation of the opponent by building a structured causal graph that captures key decision\-relevant variables and their dependencies\. Second, it populates the structural relationships of this SCM using personalized reasoning examples derived from past interactions\. During inference, SOM traverses the graph to simulate the opponent’s reasoning process step by step, enabling explicit and adaptive opponent modeling\.
### 2\.3\.Opponent Modeling

Opponent modeling \(OM\), which analyzes and predicts other agents’ behaviors in multi\-agent systems, is a fundamental technique\. To tackle unknown and non\-stationary opponents: Encoder\-decoder architecturespapoudakis2021agentidentify opponent models using only the controlled agent’s local information\. UAOMyang2025uncertaintycaptures aleatoric and epistemic uncertainties in stochastic opponent behaviors\.Meta\-learned Bayesian belief inferencezintgraf2021deepcombines variational autoencoders to model opponent beliefs; the meta\-multiagent policy gradient theoremkim2021policyadapts to new agents by accounting for mutual non\-stationary dynamics\. GSCUfu2022greedylearns offline opponent policy embeddings and trains a universal best\-response model\. For diverse opponents, MBOMyu2022modelsimulates recursive reasoning via an environment model, adapting to various types by mixing improved policies\. OEOMjing2025opencontinuously generates diverse opponents via population\-based training and enhances robustness with in\-context reinforcement learning\. To exploit opponents: L2Ewu2022l2egains exploitation abilities through minimal interactions; M\-FOSlu2022modelachieves long\-horizon shaping via model\-free optimization; MOLhu2023modelinguses best response theory to approximate preferences for stable equilibrium improvements\. Unlike these traditional opponent modeling approaches, our work focuses on opponent modeling in LLM driven decision\-making scenarios, which has not been adequately explored in existing research\.

## 3\.Preliminaries

### 3\.1\.Partially Observable Stochastic Game

We model the multi\-agent interactions as aPartially Observable Stochastic Game \(POSG\), a standard framework for sequential decision\-making with multiple agents\. A POSG is formally defined by the tupleyang2020overview:

⟨N,S,\{Ai\}i=1N,P,\{Ri\}i=1N,γ,\{Oi\}i=1N,Q⟩,\\langle N,S,\\\{A^\{i\}\\\}\_\{i=1\}^\{N\},P,\\\{R^\{i\}\\\}\_\{i=1\}^\{N\},\\gamma,\\\{O^\{i\}\\\}\_\{i=1\}^\{N\},Q\\rangle,where,NNis the set of agents, andSSdenotes the state space\. Each agentiihas an individual action spaceAiA^\{i\}, and the joint action space is defined asA=×i=1NAiA=\\times\_\{i=1\}^\{N\}A^\{i\}\. The state transition function is given byP:S×A→Δ​\(S\)P:S\\times A\\rightarrow\\Delta\(S\), whereP​\(s′∣s,a\)P\(s^\{\\prime\}\\mid s,a\)denotes the probability of transitioning from statessto states′s^\{\\prime\}after taking joint actionaa\. Each agentiireceives a scalar reward determined by its reward functionRi:S×A×S→ℝR^\{i\}:S\\times A\\times S\\rightarrow\\mathbb\{R\}, which gives a scalar reward for the transition\(s,a\)→s′\(s,a\)\\rightarrow s^\{\\prime\}\.γ∈\[0,1\]\\gamma\\in\[0,1\]is the discount factor\.

Each agentiireceives observationsoi∈Oio^\{i\}\\in O^\{i\}from the environment, and the joint observation space is defined asO=×i=1NOiO=\\times\_\{i=1\}^\{N\}O^\{i\}\. The observation functionQ:S×A×S→Δ​\(O\)Q:S\\times A\\times S\\rightarrow\\Delta\(O\)specifies the probability of receiving a joint observationoogiven joint actionaaand next states′s^\{\\prime\}, i\.e\.,Q​\(o∣a,s′\)Q\(o\\mid a,s^\{\\prime\}\)\.

The agent’s local history at timettis the sequence of its past observations, actions, and rewards:hti=\(o0i,a0i,r0i,…,at−1i,rt−1i,oti\)h\_\{t\}^\{i\}=\(o\_\{0\}^\{i\},a\_\{0\}^\{i\},r\_\{0\}^\{i\},\\dots,a\_\{t\-1\}^\{i\},r\_\{t\-1\}^\{i\},o\_\{t\}^\{i\}\)\. The agent’s policy maps this history to a distribution over actions: πi​\(ati∣hti\)\\pi^\{i\}\(a\_\{t\}^\{i\}\\mid h\_\{t\}^\{i\}\)\.

In this work, we focus on the perspective of the self\-agent \(the agent under our control\), denoted by superscriptii\. All other agents, collectively denoted by−i\-i, are treated as opponents\. Each opponent’s policyπ−i\\pi^\{\-i\}is sampled from a predefined and diverse policy setΠopp\\Pi^\{\\text\{opp\}\}, which includes fixed, rule\-based, and adaptive strategies\.

During adaptation, the self\-agentiiinteracts repeatedly with opponents overMMepisodes of POSG\. The objective is to derive a policyπi\\pi^\{i\}that maximizes the expected cumulative reward over the time horizonTTand across allMMepisodes:

maxπi⁡𝔼π−i∼Πopp​\[∑m=1M∑t=0TRti\]\.\\max\_\{\\pi^\{i\}\}\\mathbb\{E\}\_\{\\pi^\{\-i\}\\sim\\Pi^\{\\text\{opp\}\}\}\\left\[\\sum\_\{m=1\}^\{M\}\\sum\_\{t=0\}^\{T\}R\_\{t\}^\{i\}\\right\]\.

### 3\.2\.Structural Causal Models

AStructural Causal Model \(SCM\)pearl2000causalityprovides a formal framework for representing causal relationships, comprising a set of variables, a causal graph, and structural equations\.

##### Causal Graph\.

The causal relationships among variables are represented by a Causal Graph, denoted as𝒢​\(𝒱,ℰ\)\\mathcal\{G\}\(\\mathcal\{V\},\\mathcal\{E\}\)\.

- •𝒱\\mathcal\{V\}is a set of variables \(nodes\) in the model\.
- •ℰ\\mathcal\{E\}is a set of directed edges, where an edgeVi→VjV\_\{i\}\\to V\_\{j\}signifies thatViV\_\{i\}is a direct causal factor forVjV\_\{j\}\.

The graph𝒢\\mathcal\{G\}is a Directed Acyclic Graph \(DAG\), ensuring no causal cycles\.

##### Structural Equations\.

For each variableVj∈𝒱V\_\{j\}\\in\\mathcal\{V\}, its value is determined by its direct causal parents—denoted asP​a​\(Vj\)Pa\(V\_\{j\}\)—which are the set of variables in𝒱\\mathcal\{V\}with directed edges pointing toVjV\_\{j\}, along with an exogenous disturbance variableUjU\_\{j\}that accounts for external influences not explained by the model\. Each such relationship is captured by a structural functionfjf\_\{j\}:

Vj=fj​\(P​a​\(Vj\),…,Uj\)\.V\_\{j\}=f\_\{j\}\(Pa\(V\_\{j\}\),\\ldots,U\_\{j\}\)\.These structural functionsfjf\_\{j\}define the mechanism by which the value of each variable is determined by its direct causes\.

In our work, we adopt SCM to formalize the opponent’s decision\-making process\. The variablesVVencompass not only observable states, but also crucial latent variables representing the opponent’s internal state \(e\.g\., beliefs\)\. The causal graph𝒢\\mathcal\{G\}structures the reasoning flow from observations to beliefs and then to actions, with each step governed by a structural functionfjf\_\{j\}that represents a decision process\. While the graph and functions are unknown, the core premise of our work is that they can be dynamically inferred and approximated by LLMs\. Our framework, SOM, is designed to instantiate this SCM, using an LLM to both discover the causal structure and execute the reasoning within it\.

## 4\.Method

To overcome the limitations of unstructured opponent modeling, we propose SOM, a framework that grounds the modeling process in Structural Causal Models and enables structured reasoning from observations to opponent actions\.

### 4\.1\.Overview of SOM Framework

In multi\-agent environments, a self\-agent’s success critically hinges on accurately predicting an opponent’s next action based on its own observations and interaction history\. However, achieving precision and adaptively accurate opponent modeling in dynamic settings remains a significant challenge\.

To address this, we propose SOM, a novel framework that leverages the principles of SCMs for precision and adaptive opponent behavior prediction\. SOM’s overall architecture is designed to enhance modeling adaptability by uncovering the underlying logic of an opponent’s decisions\.

As shown in Figure[2](https://arxiv.org/html/2605.07301#S2.F2), the framework consists of two interconnected mechanisms: Dynamic SCM Construction and Refinement, which builds and updates the structured representation of the opponent’s decision process through a causal graph; and Reasoning for Opponent Prediction and Adaptation, which performs structured inference within this SCM using personalized reasoning knowledge to predict opponent actions\. The following sections detail these two components and their interactions\.

### 4\.2\.Dynamic SCM Construction and Refinement

To systematically establish a structured link between observations and opponent behavior, SOM grounds opponent model construction in the framework of Structural Causal Models \(SCMs\), where this structured dependency is explicitly represented through a causal graph𝒢​\(𝒱,ℰ\)\\mathcal\{G\}\(\\mathcal\{V\},\\mathcal\{E\}\)\. Accordingly, SOM dynamically constructs and continuously refines this graph to capture how observable factors and latent reasoning variables jointly shape the opponent’s decisions over time\. The process follows an ”observation—reflection—extraction—consolidation—pruning” cycle, enabling the model to adaptively update its representation of the opponent’s decision logic as interactions unfold\.

Graph Initialization\.SOM begins by constructing a minimal directed acyclic graph𝒢0=\(𝒱0,ℰ0\)\\mathcal\{G\}\_\{0\}=\(\\mathcal\{V\}\_\{0\},\\mathcal\{E\}\_\{0\}\)before interaction\. The initial node set𝒱0\\mathcal\{V\}\_\{0\}includes all observable variables\{ot,1i,…,ot,ki\}\\\{o^\{i\}\_\{t,1\},\\dots,o^\{i\}\_\{t,k\}\\\}and the opponent’s actionat−ia^\{\-i\}\_\{t\}\. Edges are added from each observation variable to the opponent actionℰ0=\{\(ot,ki,at−i\)∣∀k\}\\mathcal\{E\}\_\{0\}=\\\{\(o^\{i\}\_\{t,k\},a^\{\-i\}\_\{t\}\)\\mid\\forall k\\\}, representing an initial hypothesis that all observations may directly influence the opponent’s action\.

Reflection Phase\.As the interaction proceeds, after observing the opponent’s actual actionat−ia^\{\-i\}\_\{t\}in each round, SOM prompts the LLM to generate a natural language reflection\. This reflection, based on the interaction history, the agent’s current observationotio^\{i\}\_\{t\}, and the opponent’s actionat−ia^\{\-i\}\_\{t\}, hypothesizes potential intermediate reasoning steps or latent beliefs that led the opponent from observations to the final action\.

Structured Extraction\.Another LLM module parses this reflective text, transforming unstructured natural language into structured causal chains\. This process extracts intermediate nodesVmidV\_\{\\text\{mid\}\}that lie between observations and actions, along with the specific causal pathways they form \(e\.g\.,ot,ki→Vmid→at−io^\{i\}\_\{t,k\}\\to V\_\{\\text\{mid\}\}\\to a^\{\-i\}\_\{t\}\)\.

Graph Update and Consolidation\.After extraction, the system executes Graph Update and Consolidation\. For each newly extracted intermediate nodeVnewV\_\{\\text\{new\}\}, SOM queries an LLM to determine if it is semantically equivalent to any existing node in the graph’s node set𝒱\\mathcal\{V\}\. To do this, the LLM receives the description ofVnewV\_\{\\text\{new\}\}and a list of all existing nodes in𝒱\\mathcal\{V\}, and then makes a matching decision\. Concurrently, the system maintains a reinforcement countc​\(V\)c\(V\)for each intermediate nodeV∈𝒱V\\in\\mathcal\{V\}: ifVnewV\_\{\\text\{new\}\}matches an existing nodeVexistV\_\{\\text\{exist\}\}, its count is incremented \(c​\(Vexist\)←c​\(Vexist\)\+1c\(V\_\{\\text\{exist\}\}\)\\leftarrow c\(V\_\{\\text\{exist\}\}\)\+1\)\. If no match is found,VnewV\_\{\\text\{new\}\}is added to𝒱\\mathcal\{V\}as a new node withc​\(Vnew\)=1c\(V\_\{\\text\{new\}\}\)=1, and the edge setℰ\\mathcal\{E\}is updated according to the extracted causal chain\.

Graph Refinement and Pruning\.To control complexity and retain critical causal hypotheses, the framework performs Graph Expansion and Refinement\. After each update, SOM ranks all intermediate nodes based on their reinforcement countsc​\(V\)c\(V\)and retains only the top\-KKnodes\. Nodes below this rank are pruned from the graph, ensuring the graph remains concise and effective by preserving repeatedly validated decision logic and discarding transient or refuted hypotheses\.

### 4\.3\.Reasoning for Opponent Prediction and Adaptation

Given the constructed SCM that represents the opponent’s decision process, SOM performs opponent prediction through structured reasoning along the dependencies encoded in the model\. Specifically, reasoning proceeds in three stages—topological inference, example\-guided reasoning, and personalized adaptation\. By simulating the functional relationships among variables defined in the SCM, SOM predicts the opponent’s next actionat\+1−ia^\{\-i\}\_\{t\+1\}and continually updates its understanding of the opponent as interactions unfold\.

SOM traverses the causal graph𝒢\\mathcal\{G\}in topological order, ensuring that each nodeVjV\_\{j\}is inferred only after all of its parent nodesP​a​\(Vj\)Pa\(V\_\{j\}\)have been determined\. The root nodes, typically the agent’s observationsot\+1io^\{i\}\_\{t\+1\}, obtain their values directly from the environment, whereas intermediate and action nodes are computed through a step\-by\-step inference process that depends on their parent nodes\. For each nodeVjV\_\{j\}, its value is determined by a structural equationVj=fj​\(P​a​\(Vj\)\)V\_\{j\}=f\_\{j\}\(Pa\(V\_\{j\}\)\), which is implemented by an LLM equipped with dynamically updated knowledge to simulate the parent\-to\-child causal mapping\.

To enable accurate node inference, SOM constructs a tailored prompt for the LLM\. The prompt consists of two components: \(i\) the current inferential context, namely the determined values of all parent nodesP​a​\(Vj\)Pa\(V\_\{j\}\), and \(ii\) relevant reasoning examples retrieved from an opponent\-specific example pool𝒫opponent\\mathcal\{P\}\_\{\\text\{opponent\}\}\. The retrieval process first converts the parent nodes and their values into a textual query, and then performs semantic\-similarity search in𝒫\\mathcal\{P\}to identify the top\-MMmost similar examples\. Combining the context with these retrieved examples, the LLM performs example\-guided reasoning to infer the most likely value ofVjV\_\{j\}and generate the corresponding reasoning text\. This personalized process enhances prediction stability\.

Algorithm 1SOM Opponent Modeling Loop1:Initialize:Minimal Causal Graph

𝒢\\mathcal\{G\}; Example Pool

𝒫opponent←∅\\mathcal\{P\}\_\{\\text\{opponent\}\}\\leftarrow\\emptyset
2:foreach interaction round

t=1t=1to

TTdo

3:Observe current observation

otio^\{i\}\_\{t\}and opponent’s actual action

at−ia^\{\-i\}\_\{t\}
4:Construct/Update Graph:Prompt LLM to hypothesize causal links from

otio^\{i\}\_\{t\}to

at−ia^\{\-i\}\_\{t\}
5:Extract nodes and edges to update the graph

𝒢\\mathcal\{G\}
6:Update Example Pool:If prediction for round

t−1t\-1\(

a^t−1−i\\hat\{a\}^\{\-i\}\_\{t\-1\}\) was correct, add its successful parent\-to\-child reasoning steps to

𝒫opponent\\mathcal\{P\}\_\{\\text\{opponent\}\}
7:Predict Next Action:Traverse

𝒢\\mathcal\{G\}in topological order

8:foreach node

VjV\_\{j\}in topological orderdo

9:Retrieve examples based on parent nodes

P​a​\(Vj\)Pa\(V\_\{j\}\)
10:Infer node value

Vj=fj​\(P​a​\(Vj\),examples\)V\_\{j\}=f\_\{j\}\(Pa\(V\_\{j\}\),\\text\{examples\}\)
11:endfor

12:Output predicted action

a^t\+1−i=value of​Va−i\\hat\{a\}^\{\-i\}\_\{t\+1\}=\\text\{value of \}V\_\{a^\{\-i\}\}
13:Agent selects own action based on

a^t\+1−i\\hat\{a\}^\{\-i\}\_\{t\+1\}
14:endfor

SOM maintains a shared causal graph𝒢\\mathcal\{G\}while achieving personalized adaptation to different opponents through dynamically maintained, opponent\-specific example pools\. For each opponent, a distinct example pool𝒫opponent\\mathcal\{P\}\_\{\\text\{opponent\}\}is incrementally populated with parent\-to\-child reasoning steps generated by the LLM \(only when the predictions are correct\)\. Each exampleeeis formally represented as a four\-tuple,e=⟨parent values,child value,reasoning text,target link⟩e=\\langle\\text\{parent values\},\\text\{child value\},\\text\{reasoning text\},\\\\ \\text\{target link\}\\rangle, capturing one validated reasoning event\. A strict credit\-assignment policy ensures the quality of the stored knowledge: only when the predicted actiona^t\+1−i\\hat\{a\}^\{\-i\}\_\{t\+1\}matches the observed actionat\+1−ia^\{\-i\}\_\{t\+1\}are all intermediate reasoning steps accepted, formatted as examples, and stored in the corresponding pool\. This mechanism accumulates high\-quality, opponent\-specific reasoning knowledge, enabling SOM to perform highly personalized and adaptive opponent modeling\. The complete process of SOM is detailed in Algorithm 1\.

## 5\.Experiment

Table 1\.Win rates of different reasoning methods against various opponents in the G0\.8A game\. Rows represent the evaluated agent, and columns represent the opponent type\. SOM achieves the highest overall average win rate, particularly excelling against the Mixed opponent group that aggregates diverse reasoning strategies\.Evaluated MethodOpponent MethodAvg\.LLM onlyCoTToTK\-RReflexionOursMixedLLM only0\.190\.040\.120\.020\.100\.030\.070\.08CoTwei2022chain0\.680\.160\.540\.280\.320\.090\.360\.35ToTyao2023tree0\.460\.180\.220\.120\.220\.110\.260\.22K\-Rzhang2024k0\.840\.540\.480\.240\.450\.170\.420\.45Reflexionshinn2023reflexion0\.640\.100\.400\.200\.260\.230\.540\.34Ours0\.80±\\pm0\.140\.61±\\pm0\.110\.59±\\pm0\.090\.39±\\pm0\.120\.47±\\pm0\.080\.19±\\pm0\.100\.64±\\pm0\.130\.53### 5\.1\.Experiment Setup

Environments\.We evaluate our approach in three distinct multi\-agent game environments:

- •G08Azhang2024k: A multi\-round number\-guessing game in which players choose a number between 1 and 100 in each round, aiming to be closest to 80% of the group average\. This is a variant of the classic ”Guess 2/3 of the Average” game proposed by Ledouxledoux1981concours, where success hinges on accurately anticipating others’ choices\.
- •Survival Auction Game \(SAG\)mao2024alympics: A multi\-round sealed\-bid auction game, adapted from the classic sealed\-bid auction gamevickrey1961counterspeculation, where players bid for water to restore health points\. In each round, players submit bids privately, and the highest bidder wins the water\. Success hinges on accurately anticipating opponents’ bids to acquire water at the lowest possible cost\.
- •Undercover Gamexu2023magic: A social deduction game where players are Civilians or Undercovers with different words\. Players infer their own roles from clues\. Civilians aim to identify Undercovers, who try to conceal their roles\. The core is reasoning about others’ roles based on their behaviors\.

Enhanced Reasoning Baselines\.Recent advances in prompting techniques have significantly improved the reasoning capabilities of large language models\. We focus on four representative baseline methods:

- •Chain of Thought \(CoT\)wei2022chainis a prompting method that guides LLMs to generate explicit intermediate reasoning steps, enabling them to decompose complex problems into simpler parts\.
- •Tree of Thoughts \(ToT\)yao2023treegeneralizes CoT by allowing LLMs to explore multiple reasoning paths\. It facilitates deliberate decision\-making through evaluating multiple reasoning paths, self\-evaluating progress, and applying lookahead and backtracking strategies\.
- •K\-Level Reasoning \(K\-R\)zhang2024kequips LLMs with recursive strategic reasoning, enabling agents to form higher\-order beliefs about others’ beliefs and adapt dynamically in multi\-agent environments\.
- •Reflexionshinn2023reflexionenables LLM agents to improve through linguistic feedback instead of weight updates, by verbally reflecting on task feedback and storing reflections in episodic memory for better future decisions\.

Meanwhile, we introduce an additional baseline named Mixed Opponent \(Mixed\), which is composed by randomly sampling opponent behaviors from CoT, ToT, K\-R, and Reflexion agents\. This baseline is designed to simulate a more diverse and uncertain opponent environment\.

Table 2\.Average survival rounds of different reasoning methods against various opponents in the Survival Auction Game \(SAG\)\. Rows denote the evaluated reasoning method, while columns denote the opponent type\. SOM achieves the longest survival across most opponent types, especially under the Mixed setting, demonstrating its robustness and adaptability in dynamic auction interactions\.Evaluated MethodOpponent MethodAvg\.LLM onlyCoTToTK\-RReflexionOursMixedLLM only5\.74\.05\.44\.76\.84\.24\.95\.1CoTwei2022chain6\.55\.07\.85\.65\.64\.95\.55\.8ToTyao2023tree6\.05\.83\.75\.16\.45\.54\.65\.3K\-Rzhang2024k8\.18\.47\.85\.67\.46\.16\.27\.1Reflexionshinn2023reflexion3\.73\.77\.26\.64\.45\.85\.25\.2Ours9\.1±\\pm0\.738\.8±\\pm0\.808\.3±\\pm0\.697\.9±\\pm0\.818\.1±\\pm0\.804\.9±\\pm0\.727\.4±\\pm0\.837\.8For a fair comparison, all methods are provided with a warm\-up phase of 5 episodes prior to evaluation, during which interaction histories are collected\. These histories are supplied as contextual input to the baseline methods during evaluation\. During the evaluation phase, no additional cross\-episode history is provided\. Similarly, our method also fixes the SCM structure during evaluation and does not perform any cross\-episode updates or adaptation, in order to ensure consistency and reproducibility across multi\-runs\. Unless otherwise specified, all methods and results are evaluated using GPT\-4o as the base model\.

More detailed experimental settings can be found in the supplementary materials\.

### 5\.2\.Results

##### G0\.8A Game and Survival Auction Game

Tables[1](https://arxiv.org/html/2605.07301#S5.T1)and[2](https://arxiv.org/html/2605.07301#S5.T2)summarize performance across G0\.8A and SAG environments\. In G0\.8A, SOM achieves the highest average win rate \(0\.53\)\. Against Mixed opponents, SOM substantially outperforms ToT and CoT, demonstrating its adaptability to heterogeneous strategies\. While K\-R excels against single LLM\-only opponents, its performance declines against diverse groups, showing the limitations of fixedkk\-level assumptions under non\-stationary behaviors\. Reflexion shows moderate gains but slightly higher win rates when SOM is the opponent; this asymmetry reflects that SOM’s stable, equilibrium\-oriented reasoning may increase tie frequency in a game where win rates are theoretically low \(0\.2–0\.3\)\. In contrast, ToT and CoT struggle with dynamic mixed strategies, confirming that explicit two\-stage modeling provides a tangible advantage\.

In SAG, SOM consistently leads all baselines with an average survival of 7\.8 rounds\. Notably, SOM surpasses K\-R in nearly all matchups, highlighting its advantage when optimal bidding requires continuous adjustment\. CoT and ToT exhibit variable performance, with ToT struggling against heterogeneous strategies, underscoring the limitations of static tree\-based reasoning\. While Reflexion improves via episodic feedback, it lacks SOM’s stability due to the absence of structured model construction\. Across both environments, SOM’s two\-stage approach—separating model construction from prediction and integrating opponent\-specific knowledge—demonstrates robust adaptability and superior performance against diverse or dynamically changing opponents\.

##### Undercover Game\.

In the Undercover Game \(Figure[4](https://arxiv.org/html/2605.07301#S5.F4)\), which requires linguistic reasoning and implicit role inference, SOM again demonstrates consistent superiority over all baselines\. It achieves higher win rates both as a Civilian \(Figure[4\(a\)](https://arxiv.org/html/2605.07301#S5.F4.sf1)\), when identifying deceptive language patterns, and as an Undercover \(Figure[4\(b\)](https://arxiv.org/html/2605.07301#S5.F4.sf2)\), when strategically concealing its role\. This performance improvement highlights SOM’s ability to integrate structural knowledge about discourse patterns—such as topic shifts and semantic divergence—into its reasoning process\. While CoT and ToT often overfit to surface\-level linguistic cues, SOM’s structured model allows it to capture how utterances functionally depend on hidden role intent\.

##### Multi\-Round Interaction Analysis

To investigate SOM’s long\-term performance, we analyze its prediction deviation and win rate in G0\.8A over continuous episodes\. Both SOM and Reflexion are initialized with full historical context—SOM through state refinement and Reflexion via retrieved historical reflections—against LLM\-only opponents\.

![Refer to caption](https://arxiv.org/html/2605.07301v1/x3.png)\(a\)Prediction deviation\.
![Refer to caption](https://arxiv.org/html/2605.07301v1/x4.png)\(b\)Win rate over episodes\.

Figure 3\.Action prediction deviation and win rate over episodes in G0\.8A\. \(a\) Prediction Deviation: SOM maintains higher accuracy and stability than Reflexion\. \(b\) Win Rate: SOM exhibits superior learning progress and a higher final win rate over extended episodes\.As shown in Figure[3\(a\)](https://arxiv.org/html/2605.07301#S5.F3.sf1), SOM’s prediction deviation steadily decreases and stabilizes below Reflexion, indicating that the dynamic SCM refinement effectively captures opponent patterns\. This superior accuracy translates into a strategic advantage: Figure[3\(b\)](https://arxiv.org/html/2605.07301#S5.F3.sf2)shows SOM’s win rate consistently increases in tandem with error reduction, eventually significantly outperforming all baselines\. Conversely, Reflexion plateaus due to the absence of a structured, continuous modeling process\. These results validate SOM’s core design—improving decision quality through adaptive opponent modeling—and demonstrate its robust capacity for progressive reasoning and adaptation\.

Table 3\.Ablation study of SOM components\. We incrementally add SOM’s core modules to evaluate their impact on Prediction Deviation and Win Rate in the G0\.8A game\.Model VariantPredictionDeviation↓\\downarrow\(%\)Win Rate↑\\uparrowLLM\-only43\.00\.04\+ Static Graph30\.40\.19\+ Intermediate Nodes27\.10\.51\+ Graph Refine26\.90\.54\+ Reasoning Examples\(SOM\)25\.30\.61

### 5\.3\.Analysis of SOM’s Components

#### 5\.3\.1\.Ablation Study

To validate the effectiveness of each core component of SOM, we conduct a series of ablation experiments by incrementally adding key modules and evaluating their impact on model performance\. The experiments are carried out in the G0\.8A game environment, and the results are summarized in Table[3](https://arxiv.org/html/2605.07301#S5.T3)\.

LLM\-only:As the baseline setting, this variant involves no structured modeling\. It yields the highest prediction deviation \(43\.0%\) and the lowest win rate, indicating the limitations of relying solely on end\-to\-end language model reasoning without structural guidance\.

![Refer to caption](https://arxiv.org/html/2605.07301v1/x5.png)\(a\)As Civilian in the game
![Refer to caption](https://arxiv.org/html/2605.07301v1/x6.png)\(b\)As Undercover in the game

Figure 4\.Win Rate against different opponents in Undercover game\. The performance of SOM and baseline methods is evaluated against LLM\-only opponents and Mixed opponents\. SOM consistently outperforms all baselines in both scenarios\.\+ Static Graph:When a static causal graph is introduced, consisting only of direct edges from observation nodes to the action node—the prediction deviation drops significantly, and win rate improves\. This demonstrates that even a basic hypothesized reasoning structure can provide meaningful guidance for the LLM’s inference process, improving both stability and directionality\.

\+ Intermediate Nodes:We then incorporate the mechanism for dynamically extracting intermediate variables from LLM\-generated reflections\. This leads to a substantial boost in both prediction accuracy and win rate, highlighting that the key to effective modeling lies not merely in surface\-level observation\-action mappings, but in uncovering intermediate reasoning steps that reflect the opponent’s underlying decision process\. These steps guide the LLM to reason in a structured, step\-by\-step manner\.

\+ Graph Refine:The addition of the graph refinement and pruning mechanism—based on reinforcement counts—helps retain reliable causal paths and eliminate spurious or hallucinated connections\. This further stabilizes performance by reducing redundancy and preserving the most consistent decision logic\.

\+ Reasoning Examples \(SOM\):Finally, we enable the full SOM framework by adding the personalized example\-pool mechanism\. This module retrieves and leverages previously successful reasoning trajectories that are semantically similar to the current context, effectively simulating the structural equationsVj=fj​\(P​a​\(Vj\)\)V\_\{j\}=f\_\{j\}\(Pa\(V\_\{j\}\)\)defined in the SCM framework\. This step validates the central advantage of our two\-stage design: while the causal graph captures the general structure of an opponent’s decision logic, the reasoning examples instantiate the functional mappings within that structure, enabling highly personalized and adaptive opponent modeling\. This results in the lowest prediction deviation \(25\.3%\) and the highest win rate \(0\.61\) among all variants\.

Table 4\.Knowledge transfer test\. Each agent plays against the same strong opponent: CoT \(GPT\-4o\)\.SOM\-Tdenotes a transferred SOM model originally constructed by a GPT\-4o agent during its interaction with the CoT \(GPT\-4o\) opponent\.Agent VariantPredictionDeviation↓\\downarrow\(%\)Win Rate↑\\uparrowGPT\-4o \+ SOM25\.30\.61LLaMA\-3\-8B \+ CoT76\.10\.07LLaMA\-3\-8B \+ SOM\-T45\.80\.31Mixtral\-8B \+ CoT95\.60\.02Mixtral\-8B \+ SOM\-T65\.20\.27Overall, the ablation results clearly demonstrate that each component of SOM contributes significantly to its final performance\. In particular, the introduction of intermediate variables and the use of personalized reasoning examples are critical to improving both predictive accuracy and strategic decision\-making\.

#### 5\.3\.2\.Structured Knowledge Transfer Analysis

One core advantage of SOM is its ability to construct structured opponent models that generalize across different LLMs\. To validate this, we conduct a knowledge transfer experiment \(Table[4](https://arxiv.org/html/2605.07301#S5.T4)\), testing whether the SOM\-generated model—comprising a causal graph𝒢\\mathcal\{G\}and an opponent\-specific example pool𝒫opponent\\mathcal\{P\}\_\{\\text\{opponent\}\}—can be effectively reused by other agents\.

We begin by allowing a strong agent \(GPT\-4o \+ SOM\) to interact with a strong opponent \(CoT\-driven GPT\-4o\) and store the final constructed opponent model, including the causal graph and the reasoning examples\. Then, we test two weaker open\-source models \(LLaMA\-3\-8B and Mixtral\-8B\) by directly loading this constructed model \(denoted asSOM\-T\) and using it to play against the same CoT \(GPT\-4o\) opponent\.

As shown in Table[4](https://arxiv.org/html/2605.07301#S5.T4), SOM\-T substantially improves the performance of weaker models without any additional training\. When LLaMA\-3\-8B uses the transferred SOM model, its prediction deviation falls from 76\.1% to 45\.8% and its win rate rises from 0\.07 to 0\.31\. Mixtral\-8B shows the same pattern\. Importantly, the improvement is achieved by directly loading the SOM\-based opponent model into the target agent, without fine\-tuning the target LLM\. This demonstrates that the structured representation learned by SOM encodes reusable behavioral regularities that benefit different model architectures\. At the same time, transferred models do not fully close the gap to the high\-capacity SOM instantiation, indicating that recipient model capacity and inference ability still constrain final performance\. In short, SOM produces structured opponent knowledge that is transferable and practically useful across LLMs, while remaining complementary to improvements in base model capacity\.

## 6\.Conclusion

We introduceSOM, a novel two\-stage opponent modeling framework inspired by structural causal modeling principles\. By dynamically constructing and refining a structured reasoning graph, SOM explicitly decouples the process of opponent model construction from that of behavior prediction\. Comprehensive experiments demonstrate that this structured reasoning approach substantially improves prediction accuracy and decision\-making performance across diverse game\-theoretic environments\. Furthermore, the modular opponent models produced by SOM can be seamlessly transferred to empower other agents, highlighting its generality and practical utility\.

Nevertheless, we acknowledge a limitation: the ”causal” structures discovered by SOM represent functional dependencies inferred from observational data, rather than verified causal mechanisms of the opponent’s cognition\. Bridging this gap requires integrating more principled causal discovery techniques or controlled interaction settings in future work\.

Overall, SOM offers a promising step toward building LLM\-based agents that are more adaptive, interpretable, and robust in complex multi\-agent environments\.

\{acks\}

This work was supported by the National Science and Technology Major Project under Grant No\. 2022ZD0116403, and in part by the Beijing Natural Science Foundation under Grant No\. 4264131\.

## References

Similar Articles

Preference Estimation via Opponent Modeling in Multi-Agent Negotiation

arXiv cs.CL

This paper proposes a novel preference estimation method that integrates natural language information from LLMs into a structured Bayesian opponent modeling framework for multi-agent negotiation. The approach leverages LLMs to extract qualitative cues from utterances and convert them into probabilistic formats, demonstrating improved agreement rates and preference estimation accuracy on multi-party negotiation benchmarks.

Learning to model other minds

OpenAI Blog

OpenAI and University of Oxford researchers present LOLA (Learning with Opponent-Learning Awareness), a reinforcement learning method that enables agents to model and account for the learning of other agents, discovering cooperative strategies in multi-agent games like the iterated prisoner's dilemma and coin game.

@omarsar0: // The Memory Curse in LLM Agents // (bookmark it) Long histories apparently degrades agents as they become increasingl…

X AI KOLs Following

This research paper identifies the 'memory curse' in LLM agents, demonstrating that expanded context windows systematically degrade cooperative behavior in multi-agent social dilemmas by eroding forward-looking intent. The authors show that targeted fine-tuning, synthetic memory sanitization, and reducing explicit Chain-of-Thought reasoning can effectively mitigate this behavioral decay.

Cognitive Agent Compilation for Explicit Problem Solver Modeling

arXiv cs.CL

The paper introduces Cognitive Agent Compilation (CAC), a framework that uses teacher LLMs to compile problem-solving knowledge into explicit, inspectable agents for educational applications. It aims to address the lack of controllability and explainability in standard LLMs by separating knowledge representation from policy and verification rules.