OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

arXiv cs.AI Papers

Summary

This paper presents OSCToM, an RL-guided method for generating adversarial data to test nested belief conflicts in LLMs, improving Theory of Mind reasoning on benchmarks like FANToM.

arXiv:2605.20423v1 Announce Type: new Abstract: Large Language Models (LLMs) perform well on many language tasks, but their Theory of Mind (ToM) reasoning is still uneven in complex social settings. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult. This paper presents OSCToM (Observer-Self Conflict Theory of Mind), an approach for modeling nested belief conflicts in LLM-based ToM tasks. The key case is one in which an observer's view of another agent conflicts with the observer's own belief state. Such cases go beyond simple perspective-taking and require recursive, multi-layered reasoning. OSCToM combines reinforcement learning (RL), an extended domain-specific language, and compositional surrogate models to generate observer-self conflicts. In our experiments, OSCToM-8B gives the best overall result among the systems tested. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi-ToM and BigToM. On the information-asymmetric FANToM benchmark, OSCToM reaches 76% accuracy, compared with the 0.2% reported by ExploreToM. The data-synthesis procedure is also 6x more efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning. The project code is available at https://github.com/sharminsrishty/osct.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:46 AM

# OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
Source: [https://arxiv.org/html/2605.20423](https://arxiv.org/html/2605.20423)
Sharmin Sultana Srishty Department of Computer Science BRAC University sharmin\.sultana\.srishty@g\.bracu\.ac\.bd &Kazi Mahathir Rahman Department of Computer Science BRAC University kazi\.mahathir\.rahman@g\.bracu\.ac\.bd &Malaika Parizat Sakkhi Department of Computer Science BRAC University malaika\.parizat\.sakkhi@g\.bracu\.ac\.bd &Samia Shahid Prianna Department of Computer Science BRAC University samia\.shahid\.prianna@g\.bracu\.ac\.bd &Shaikhul Islam Sinat Department of Computer Science BRAC University shaikhul\.islam\.sinat@g\.bracu\.ac\.bd

###### Abstract

Large Language Models \(LLMs\) perform well on many language tasks, but their Theory of Mind \(ToM\) reasoning is still uneven in complex social settings\. Existing benchmarks, including ExploreToM, do not always test the recursive beliefs and information asymmetries that make these settings difficult\. This paper presentsOSCToM\(Observer\-Self Conflict Theory of Mind\), an approach for modeling nested belief conflicts in LLM\-based ToM tasks\. The key case is one in which an observer’s view of another agent conflicts with the observer’s own belief state\. Such cases go beyond simple perspective\-taking and require recursive, multi\-layered reasoning\. OSCToM combines reinforcement learning \(RL\), an extended domain\-specific language, and compositional surrogate models to generate observer\-self conflicts\. In our experiments, OSCToM\-8B gives the best overall result among the systems tested\. It improves on the reported ExploreToM results on FANToM and remains competitive on Hi\-ToM and BigToM\. On the information\-asymmetricFANToMbenchmark, OSCToM reaches76%accuracy, compared with the0\.2%reported by ExploreToM\. The data\-synthesis procedure is also6xmore efficient, indicating that targeted training data can help smaller models handle advanced cognitive reasoning\. The project code is available at[https://github\.com/sharminsrishty/osct](https://github.com/sharminsrishty/osct)\.

KeywordsTheory of Mind \(ToM\), Large Language Models \(LLM\), Observer\-Self Conflict, Nested Beliefs, Reinforcement Learning \(RL\), Adversarial Benchmarks, Social Intelligence, Recursive Reasoning\.

## 1Introduction

Theory of Mind \(ToM\) is the ability to reason about the beliefs, intentions, and knowledge of other agents\. The term was introduced in studies of primate social intelligence\[[24](https://arxiv.org/html/2605.20423#bib.bib8)\]and later became central to work on human social cognition\. ToM supports many forms of social interaction, including cooperation, persuasion, and deception\. In everyday reasoning, people do not only track facts; they also track what others know, what others falsely believe, and how those beliefs differ from reality\. For Large Language Models \(LLMs\), this ability is an important part of social reasoning\. It also shifts evaluation beyond fluent text generation toward the question of whether a model can represent and update mental states\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/osctom_architecture.png)Figure 1:The OSCToM system architecture\. The left module depicts the Adversarial Story Generator \(combining RL and surrogate evaluation\) which constructs the OSCT dataset\. The right module depicts the subsequent LLM Fine\-Tuning pipeline, resulting in the final OSCToM\-8B model\.Early NLP work tested ToM mostly with static, hand\-written vignettes\. A common example is ToMi\[[19](https://arxiv.org/html/2605.20423#bib.bib7)\], which evaluates simple false\-belief tracking\. These tasks were useful, but they are now less reliable as model scale and training data have increased\. A model may score well by relying on familiar story patterns or surface cues rather than by tracking beliefs in a consistent way\[[18](https://arxiv.org/html/2605.20423#bib.bib5),[26](https://arxiv.org/html/2605.20423#bib.bib6),[6](https://arxiv.org/html/2605.20423#bib.bib13)\]\. This creates a reasoning gap: models can perform well on standard examples but fail when the same logic is presented through a slightly different narrative structure\.

Programmatic and adversarial generation offers one way to reduce these limits\. The ExploreToM framework\[[28](https://arxiv.org/html/2605.20423#bib.bib1)\], for example, uses a domain\-specific language \(DSL\) and A\* search to synthesize stories under difficult information conditions\. This is an important step, but search\-based generation still has limitations\. A\* search is constrained by the predefined search space and does not adapt its policy from experience\. It may also reward informational volume more than the specific structure of the belief conflict\. As a result, models can still fail on narratives that require deep information asymmetry and multi\-layered recursive tracking, such as cases where one agent reasons about another agent’s belief about a third agent’s intention\.

We focus on a type of reasoning that we callObserver\-Self Conflict\. This state appears when an observer attributes a belief to another agent while holding a different belief internally\. The conflict is not only between a character and the true world state, but also between nested perspectives inside the observer’s model of the situation\. Such cases are common in complex social reasoning, but they are difficult to generate and verify at scale\. OSCToM addresses this problem by adding these conflict structures to an optimized LLM generation and training pipeline\.

Instead of relying only on fixed search heuristics, OSCToM treats story generation as an optimization problem\. A DQN\-based reinforcement learning generator learns to move through an extended DSL that supports 4th\-order belief tracking and deceptive mental states\. Because direct LLM verification is expensive for this type of generation, we use acompositional surrogate pipeline\. The pipeline contains six specialized modules that estimate factual and belief difficulty with much lower cost than full LLM\-based verification\. This surrogate\-guided design makes large\-scale adversarial data synthesis practical and gives a6xoverall efficiency gain in pipeline execution\.

The resulting model, OSCToM\-8B, is trained with a two\-stage curriculum\. In our experiments, it shows strong 4th\-order reasoning performance and outperforms much larger models in several settings\. On the information\-asymmetricFANToMbenchmark\[[36](https://arxiv.org/html/2605.20423#bib.bib2)\], OSCToM\-8B reaches76%accuracy, compared with the0\.2%reported for the ExploreToM baseline\.

The rest of this paper is organized as follows: Section 2 reviews related work in Theory of Mind and adversarial benchmarks; Section 3 details the OSCToM framework, the Extended DSL, and our RL\-guided generation policy; Section 4 presents our experimental setup and results across global benchmarks; and Section 5 discusses the implications of our findings for the future of social AI\.

## 2Related Work

Theory of Mind \(ToM\) is usually defined as the ability to attribute mental states, such as beliefs, intentions, and knowledge, to oneself and to others\. It was first formalized in primate psychology\[[24](https://arxiv.org/html/2605.20423#bib.bib8)\]and later became an important marker in studies of human development\[[33](https://arxiv.org/html/2605.20423#bib.bib9),[3](https://arxiv.org/html/2605.20423#bib.bib10)\]\. In the domain of Artificial Intelligence, evaluating whether computational systems possess this capability has moved from philosophical inquiry into an empirical benchmark of social intelligence\. Early neural evaluations of ToM focused on simple story\-comprehension tasks, suggesting that social intelligence could emerge as a side\-effect of large\-scale language modeling\[[7](https://arxiv.org/html/2605.20423#bib.bib12)\]\. These initial benchmarks mainly used theToMidataset\[[19](https://arxiv.org/html/2605.20423#bib.bib7)\], which parameterized the classic Sally\-Anne false\-belief test into simple, linear text vignettes\. While these early evaluations suggested rising competence in models like GPT\-3, later studies showed that such performance was fragile\. Researchers showed that models often relied on shallow heuristics, spurious correlations, and narrative pattern matching rather than on a stable internal model of causal belief states\[[18](https://arxiv.org/html/2605.20423#bib.bib5)\]\. In particular, targeted behavioral studies found that small textual changes, which would be easy for a human reader, could cause large performance drops in state\-of\-the\-art models\[[32](https://arxiv.org/html/2605.20423#bib.bib14)\]\. This discrepancy between perceived fluency and actual cognitive modeling led to a broader critique in the field\[[26](https://arxiv.org/html/2605.20423#bib.bib6),[6](https://arxiv.org/html/2605.20423#bib.bib13)\], highlighting a persistent "reasoning gap" that required a shift toward more complex, multi\-layered reasoning benchmarks designed to test these models more effectively\.

These findings motivated harder benchmarks for higher\-order and recursive belief reasoning\. Studies such asHi\-ToM\[[37](https://arxiv.org/html/2605.20423#bib.bib3)\]andBigToM\[[12](https://arxiv.org/html/2605.20423#bib.bib4)\]aimed to test second\- and third\-order recursive beliefs and to separate true social reasoning from generalized factual recall errors\. Other work, includingToMChallenges\[[20](https://arxiv.org/html/2605.20423#bib.bib11)\], has also emphasized that small changes in task wording, belief order, and information access can strongly affect measured ToM performance\. This matters for OSCToM because observer\-self conflict depends on the exact relation between what an agent knows, what the agent believes, and what the agent thinks another character believes\. At the same time, large\-scale behavioral evaluations were performed comparing LLM performance directly against human baselines\[[4](https://arxiv.org/html/2605.20423#bib.bib15)\], while multi\-order assessments examined the limits of recursive depth\[[30](https://arxiv.org/html/2605.20423#bib.bib16)\]\. The results were mixed\. LLMs sometimes matched adult human performance on fixed behavioral tasks, but they failed unpredictably when the task structure changed, suggesting weak general logical grounding\[[29](https://arxiv.org/html/2605.20423#bib.bib17)\]\. To address the limitations inherent in passive observation tasks, benchmarking efforts increasingly shifted toward scenarios involving information asymmetry and dynamic, interactive contexts\. The introduction of theFANToMbenchmark\[[36](https://arxiv.org/html/2605.20423#bib.bib2)\]was an important step in this regard, testing models in multi\-party conversational settings where characters possess unequal access to ground\-truth information\. This change showed that even models excelling at static 1st\-order scenarios drop sharply when forced to continuously update and track "who knows what" across an evolving dialogue\[[9](https://arxiv.org/html/2605.20423#bib.bib18)\]\.

Alongside benchmark design, researchers have studied whether these successes and failures correspond to internal model representations\. Recent interpretability work using techniques such as linear probing on hidden activations has provided early evidence that models form explicit internal representations of belief states for both self and others\[[38](https://arxiv.org/html/2605.20423#bib.bib19)\]\. This creates an important distinction\. A model may encode belief information internally, but it may still fail to use that information during complex inference or adversarial text\-based reasoning\. Seeking broader evaluation contexts beyond text, researchers have also introducedOpenToM\[[34](https://arxiv.org/html/2605.20423#bib.bib20)\], which emphasizes multi\-modal and comprehensive video\-based ToM tracking, further showing that models struggle to maintain mental\-state coherence when information is distributed across complex or novel contexts\. In addition, specialized benchmarks likeNegotiationToM\[[8](https://arxiv.org/html/2605.20423#bib.bib21)\]have shown that when models are required to use ToM strategically, acting on inferred beliefs to achieve a goal rather than passively answering a question, their cognitive models often collapse\. Taken together, these findings suggest that LLMs may contain some static reasoning components, but dynamic belief tracking in adversarial or socially complex settings remains difficult\.

For this reason, recent ToM evaluation has moved toward adversarial and programmatic data generation\. TheExploreToMframework\[[28](https://arxiv.org/html/2605.20423#bib.bib1)\]is a leading example in this area, using a domain\-specific language \(DSL\) guided by heuristicA∗A^\{\*\}search to synthesize complex adversarial stories that programmatically manipulate informational access\. By actively searching for scenarios that violate a model’s existing heuristics, ExploreToM successfully revealed large performance drops in models like Llama\-3\-70B\. However, heuristic\-based generation has computational scaling limits and a rigid search space, particularly when attempting to construct logically sound scenarios beyond 3rd\-order recursive depth\. More importantly, the existing adversarial literature focuses mainly on external perspective\-taking, and leaves out the concept ofObserver\-Self Conflict, a state where an agent’s recursive attribution of another’s belief directly contradicts their own internal factual knowledge\. Our work addresses this gap by replacing rigid heuristic search with a policy\-driven Reinforcement Learning \(RL\) approach and a compositional surrogate evaluation pipeline\. This allows us to model and generate Observer\-Self Conflicts at scale, pushing ToM evaluation beyond linear tracing and toward human\-like cognitive conflict\.

A related line of work studies how training data can shape reasoning behavior after pretraining\. Curriculum learning\[[5](https://arxiv.org/html/2605.20423#bib.bib28)\]is relevant here because high\-order ToM tasks are not uniform in difficulty\. A model that cannot reliably solve first\-order false\-belief cases is unlikely to handle third\- or fourth\-order nested beliefs in a stable way\. This is why recent work on reasoning benchmarks often separates simpler belief tracking from tasks that require recursive updates across several agents\. For OSCToM, this observation motivates a staged training design: the model first sees lower\-order belief conflicts and then moves to harder observer\-self cases\. This design follows the broader idea that reasoning ability can improve when examples are ordered by difficulty rather than presented as a single mixed dataset\. If the story becomes inconsistent, then a wrong model answer may reflect confusion in the data rather than a real failure of ToM reasoning\. This makes verification an important part of adversarial ToM generation\.

These points clarify the position of OSCToM relative to prior work\. Existing benchmarks show that LLMs struggle with recursive belief tracking, information asymmetry, and strategic use of mental\-state information\. Programmatic methods show that harder cases can be generated in a controlled way\. OSCToM combines these directions by generating examples that are both adversarial and tied to a specific cognitive structure: the conflict between an observer’s internal belief and the belief that the observer assigns to another agent\. This focus is narrower than general social reasoning, but it allows the method to target a clear and difficult failure mode\.

## 3Methodology

OSCToM has four main components\. First, we extend a domain\-specific language so it can express Observer\-Self Conflict\. Second, we train lightweight surrogate evaluators from LLM\-distilled difficulty annotations\. These evaluators provide a low\-cost estimate of narrative hardness\. Third, a reinforcement learning agent uses the surrogate reward to search the DSL state space and select action sequences that create adversarial stories\. Fourth, the generated dataset is used to fine\-tuneLlama\-3\.1\-8B\-Instructthrough a two\-stage curriculum that gradually introduces higher orders of recursive belief conflict\. Figure[2](https://arxiv.org/html/2605.20423#S3.F2)shows the full pipeline\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/osctom_workflow.png)Figure 2:The 8\-stage end\-to\-end OSCT verification and training workflow, mapping the progression from initial surrogate distillation through to final model derivation\.### 3\.1Extended Domain\-Specific Language

To move beyond the linear information flow of existing benchmarks like ExploreToM\[[28](https://arxiv.org/html/2605.20423#bib.bib1)\], we introduce theOSCT\-DSL\. This language formalizes "Observer\-Self Conflict," a state where an agent’s recursive model of another’s belief is in direct tension with their own factual knowledge\. By expanding the state space to a 4th\-order hierarchy, OSCT\-DSL allows us to programmatically verify complex scenarios that were previously difficult to track\.

We define a world state𝒲\\mathcal\{W\}and a belief stateBBfor any agentiiover a propositionpp\. While first\-order beliefsBi\(1\)​\(p\)B\_\{i\}^\{\(1\)\}\(p\)represent an agent’s internal model of reality, we formalize higher\-order recursive beliefs as:

Bi,j,…,n\(k\)​\(p\)=Bi\(1\)​\(Bj,…,n\(k−1\)​\(p\)\)B\_\{i,j,\\dots,n\}^\{\(k\)\}\(p\)=B\_\{i\}^\{\(1\)\}\(B\_\{j,\\dots,n\}^\{\(k\-1\)\}\(p\)\)\(1\)wherek∈\{2,3,4\}k\\in\\\{2,3,4\\\}denotes the order of recursion\. This allows our framework to synthesize deep information asymmetries\.

To generate these high\-order conflicts, we move beyond simple physical movements by implementing a set of adversarial primitive actions designed to break informational symmetry\. These includeDeceptive Localization\(lie\_about\_location\), where an agent communicates a false propositionp′p^\{\\prime\}to another, creating a first\-order false beliefBj\(1\)​\(p′\)B\_\{j\}^\{\(1\)\}\(p^\{\\prime\}\)while the liar’s internal model remains grounded in the true world state\. Another operator isAsymmetric Observation\(one\_way\_mirror\), which allows an agent to witness an action without the observed party gaining a reciprocal second\-order belief\. This separates the shared\-experience model common in simpler ToM benchmarks\. Finally, theRecursive Deception\(double\_bluff\) is a composite action where one agent manipulates another into passing a lie to a third party, layering a third\-order recursive belief on top of an initial first\-order falsehood\.

### 3\.2Compositional Surrogate Model Training

A major difficulty in applying Reinforcement Learning to natural language generation is the cost of computing rewards\. During training, the RL agent must evaluate thousands of candidate story trajectories\. Each trajectory requires a judgment about cognitive complexity\. If every judgment is made by a large LLM, each query can require about 70 billion floating\-point operations\. This makes the optimization loop too expensive and prevents the agent from receiving a dense reward signal for policy improvement\.

To reduce this cost, we implement theCompositional Surrogate Evaluation Pipeline\. We useKnowledge Distillation\[[15](https://arxiv.org/html/2605.20423#bib.bib37)\]to transfer the evaluation behavior ofLlama\-3\.3\-70B\[[10](https://arxiv.org/html/2605.20423#bib.bib23)\]into six compactDistilBERT\[[25](https://arxiv.org/html/2605.20423#bib.bib22)\]student modules trained on 10,000 annotated stories\. DistilBERT is used as the student backbone because it retains 97% of BERT’s language understanding performance with 40% fewer parameters\. Its small size also allows all six modules to be loaded together as a GPU inference ensemble, avoiding model\-swap overhead during RL training\.

The reward is divided into six independent cognitive dimensions instead of being reduced to one scalar score\. This choice reflects the fact that ToM reasoning involves different processes\[[11](https://arxiv.org/html/2605.20423#bib.bib41)\]\. For example, a story may contain many deceptive actions but still have shallow recursive depth\. These two cases should guide the generator in different ways\. Separate outputs therefore give the agent a more structured and interpretable reward signal\. The six modules are defined as follows:

- •False Belief Detector— A binary classifier identifying discrepancies between an agent’s internal belief and the ground\-truth world state\. Its presence is necessary: any story without a false belief is by definition not a Theory of Mind scenario\[[33](https://arxiv.org/html/2605.20423#bib.bib9)\]\.
- •ToM Depth Classifier— A 4\-class estimator classifying the maximum order of recursive belief embedding present \(1st1^\{\\text\{st\}\}through4th4^\{\\text\{th\}\}order\)\. It is given the second\-highest reward weight, as recursive depth is the most direct indicator of adversarial difficulty in OSCToM scenarios\.
- •Deception Scorer— A continuous scorer measuring the density of manipulative operators \(lies, double\-bluffs, fake memory implants\) normalized by story length\. Normalization prevents the agent from exploiting artificially extended narratives as a reward hack\.
- •Social Complexity Scorer— Measures the frequency of inter\-agent communication events relative to the active cast of agents\. Observer\-Self Conflict states cannot arise without rich multi\-agent interaction; this module ensures the generator is encouraged to construct dense social graphs\.
- •Temporal Complexity Scorer— Measures the number of temporally distinct world\-state transitions relevant to a target agent’s final belief state\. This directly addresses the well\-documented failure of language models to track chronologically ordered belief changes\[[37](https://arxiv.org/html/2605.20423#bib.bib3)\]\.
- •OSCT Detector— A binary classifier with continuous confidence scoring that specifically detects Observer\-Self Conflict states: the condition where an observer’s recursive belief model directly contradicts their own first\-person knowledge of reality\. As the primary optimization target of this work, it receives the highest single reward weight in the composite function\.

The composite reward signal is formulated as a weighted linear combination of five continuous surrogate outputsSk∈\[0,1\]S\_\{k\}\\in\[0,1\], while the False Belief Detector is used as a hard validity constraint to reject non\-ToM stories:

R​\(Story\)=0\.40⋅Sosct\+0\.30⋅Sdepth\+0\.15⋅Sdec\+0\.075⋅Ssoc\+0\.075⋅StempR\(\\text\{Story\}\)=0\.40\\cdot S\_\{\\text\{osct\}\}\+0\.30\\cdot S\_\{\\text\{depth\}\}\+0\.15\\cdot S\_\{\\text\{dec\}\}\+0\.075\\cdot S\_\{\\text\{soc\}\}\+0\.075\\cdot S\_\{\\text\{temp\}\}\(2\)The weights give priority to OSCT Detection and ToM Depth, which together account for 70% of the reward\. This keeps the policy focused on high\-order Observer\-Self Conflict states\. The remaining 30% is assigned to deception density, social complexity, and temporal complexity\. These terms reduce the chance that the agent exploits only one reward dimension and help preserve narrative coherence\. With this pipeline, per\-story evaluation drops from about 14 seconds for an LLM query to less than 50 milliseconds with surrogate inference\. This speedup makes it practical to construct a 15,000\-sample adversarial corpus\.

### 3\.3Surrogate\-Guided RL Training

We model adversarial story generation as a Markov Decision Process \(MDP\), defined by the tuple\(𝒮,𝒜,P,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\)\. The state and action spaces come from the Extended DSL\. The main challenge is to find sequences of cognitive and social operators, such asdouble\_bluffor asymmetric observation, that increase both narrative conflict and recursive depth\. To choose the generation policy, we compared several reinforcement learning methods: Asynchronous Advantage Actor\-Critic \(A2C\)\[[22](https://arxiv.org/html/2605.20423#bib.bib24)\], Proximal Policy Optimization \(PPO\)\[[27](https://arxiv.org/html/2605.20423#bib.bib27)\], and Deep Q\-Networks \(DQN\)\[[23](https://arxiv.org/html/2605.20423#bib.bib25)\]\.

Table[1](https://arxiv.org/html/2605.20423#S3.T1)summarizes the comparison\. We selected Deep Q\-Networks \(DQN\) as the main generation policy because it learned the discrete symbolic transitions of the OSCT\-DSL more efficiently\. Policy\-gradient methods such as A2C and Recurrent PPO were less stable during reward convergence\. In contrast, the DQN generator maintained more stable value estimates for DSL states\. This made it better suited for finding ToM failure cases that heuristic search methods may overlook\[[28](https://arxiv.org/html/2605.20423#bib.bib1)\]\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/mean_reward.png)Figure 3:Learning curve for the optimized DQN generator\. The vertical axis represents the mean episodic reward derived from the surrogate ensemble across 200,000 timesteps of adversarial training\.Figure[3](https://arxiv.org/html/2605.20423#S3.F3)shows the training progress of our optimized DQN architecture in Figure[3](https://arxiv.org/html/2605.20423#S3.F3)\. Using a replay buffer of100,000100,000transitions and a target update interval of500500steps, the model converged stably and consistently produced complex 4th\-order recursive scenarios\. As shown in the performance summary, the DQN\-based generator achieved the best balance between mean narrative hardness \(0\.5150\.515\) and training efficiency \(1616m3737s\), making it the most suitable engine for our large\-scale data synthesis pipeline\.

Table 1:Comparative performance of reinforcement learning architectures\. DQN was chosen as the primary generator for the OSCToM framework due to its superior stability and output quality\.
### 3\.4Adversarial Dataset Generation

The OSCT corpus is produced through a four\-stage pipeline that combines symbolic trajectory optimization with LLM\-based narrative enhancement\. At the start of each episode,ToMStoryEnvsamples a context from predefined agent names, room layouts, and object inventories\. The trained DQN policy then rolls out over 15 discrete DSL operations\. It selects physical and cognitive actions, includinglie\_about\_location,double\_bluff,one\_way\_mirror\_observation, andfake\_memory\_implant, to maximize the composite hardness reward\. At each timestep, the policy receives a 256\-dimensional observation vector encoding the normalized story length, active agent and object counts, and the surrogate module scores from the preceding episode\.

The terminal reward is computed by the compositional surrogate ensemble across five continuous cognitive dimensions\. The False Belief Detector acts as a validity gate\. The weights emphasize Observer\-Self Conflict detection and recursive depth:

H=0\.40⋅Sosct\+0\.30⋅Sdepth\+0\.15⋅Sdec\+0\.075⋅Ssoc\+0\.075⋅Stemp,H∈\[0,1\]H=0\.40\\cdot S\_\{\\text\{osct\}\}\+0\.30\\cdot S\_\{\\text\{depth\}\}\+0\.15\\cdot S\_\{\\text\{dec\}\}\+0\.075\\cdot S\_\{\\text\{soc\}\}\+0\.075\\cdot S\_\{\\text\{temp\}\},\\quad H\\in\[0,1\]\(3\)A three\-phase curriculum scheduler changes the relative weights of hardness, diversity, and validity during training\. Early episodes emphasize structural validity\. Later episodes place more weight on adversarial hardness\. The final symbolic script is passed toLlama\-3\.3\-70B\[[10](https://arxiv.org/html/2605.20423#bib.bib23)\]through the OpenRouter API\. This step converts the DSL trace into natural prose while preserving the intended cognitive markers, including deception chains, information asymmetry, and non\-linear temporal ordering\. The targetoverall\_hardnessscore is set above 0\.85\.

TheToMQuestionGeneratorthen reads the internal belief dictionaries in theExtendedToMDSLstate\. It extracts up to five question\-answer pairs per story for each recursion order, from first\-order to fourth\-order belief queries\. After generation, each story receives a difficulty label from one of five tiers\. The tiers are assigned using percentile thresholds \(P20,P40,P60,P80P\_\{20\},P\_\{40\},P\_\{60\},P\_\{80\}\) computed from the final hardness\-score distribution\. Figures[5](https://arxiv.org/html/2605.20423#S3.F5),[5](https://arxiv.org/html/2605.20423#S3.F5), and[6](https://arxiv.org/html/2605.20423#S3.F6)show the resulting difficulty distribution and surrogate scores for the 15,000\-sample corpus\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/difficulty_distribution.png)Figure 4:Difficulty tier distribution across the OSCT corpus\.
![Refer to caption](https://arxiv.org/html/2605.20423v1/images/hardness_histogram.png)Figure 5:Histogram of aggregate hardness scores across the 15,000\-sample corpus\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/module_complexity.png)Figure 6:Per\-dimension surrogate complexity scores across the six cognitive modules\.
### 3\.5Two\-Stage Curriculum Fine\-Tuning

The final component of OSCToM is supervised fine\-tuning ofLlama\-3\.1\-8B\-Instruct\[[10](https://arxiv.org/html/2605.20423#bib.bib23)\]on the OSCT corpus\. We do not train on the full dataset in one pass\. Instead, we use a two\-stage curriculum strategy\[[5](https://arxiv.org/html/2605.20423#bib.bib28)\], implemented in theCurriculumTrainerclass\. The motivation is simple: high\-order reasoning is much harder than low\-order false\-belief reasoning\. If the model sees 4th\-order Observer\-Self Conflict examples too early, training becomes less stable and generalization decreases\.

InStage 1 \(Foundation\), the model is trained only on stories containing questions at the 1st and 2nd order of Theory of Mind recursion, automatically filtered from the full OSCT dataset by selecting entries where at least one associated question carries atom\_ordervalue of 2 or below\. This stage grounds the model in basic perspective\-taking and false\-belief resolution before deceptive or asymmetric constructs are introduced\. InStage 2 \(Mastery\), the full OSCT corpus is used, including the complete range of 3rd and 4th\-order recursive scenarios\. The checkpoint produced by Stage 1 is used as the initialization point for Stage 2, ensuring that foundational belief\-tracking capabilities are preserved rather than overwritten during the higher\-order optimization\.

Parameter\-efficient fine\-tuning is performed throughLow\-Rank Adaptation \(LoRA\)\[[16](https://arxiv.org/html/2605.20423#bib.bib31)\], applied across all seven projection layers of the attention and feed\-forward blocks \(q\_proj,k\_proj,v\_proj,o\_proj,gate\_proj,up\_proj,down\_proj\)\. The adapter is configured with rankr=16r=16, scaling factorα=32\\alpha=32, and dropout=0\.05=0\.05\. The base model is loaded in 4\-bit quantization via the Unsloth optimization framework\[[14](https://arxiv.org/html/2605.20423#bib.bib40)\], allowing training within a single\-GPU budget\. Each stage uses a per\-device batch size of 2 with 8 gradient accumulation steps \(effective batch size of 16\), a linear learning rate scheduler, and the AdamW\-8bit optimizer\. Training instances are formatted as structured prompts of the form Story→\\rightarrowQuestion→\\rightarrowAnswer, with a maximum sequence length of 1,024 tokens\. The training loss across both curriculum stages is shown in Figure[7](https://arxiv.org/html/2605.20423#S3.F7)\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/training_loss_curve.png)Figure 7:Training loss across both curriculum stages of the OSCToM\-8B fine\-tuning pipeline\. Stage 1 optimizes on 1st and 2nd\-order ToM scenarios, while Stage 2 extends training to the full adversarial OSCT corpus containing 3rd and 4th\-order recursive conflicts\.

## 4Results and Evaluation

We evaluate OSCToM\-8B on benchmark accuracy and inference efficiency\. This distinction is important because ExploreToM\[[28](https://arxiv.org/html/2605.20423#bib.bib1)\]reports results from iterative A\* heuristic search rather than from direct model inference\. That procedure has an average latency of 15\.0 seconds per query and inference complexity𝒪​\(N\)\\mathcal\{O\}\(N\)\. OSCToM\-8B uses single\-pass neural inference\. Its measured latency is 2\.62 seconds, giving a5\.7xreduction in response time while exceeding ExploreToM’s FANToM accuracy by 378 percentage points\.

### 4\.1Benchmark Accuracy

We evaluate OSCToM\-8B against seven baseline systems across four Theory of Mind benchmarks:ToMi\[[19](https://arxiv.org/html/2605.20423#bib.bib7)\],Hi\-ToM\[[37](https://arxiv.org/html/2605.20423#bib.bib3)\],BigToM\[[12](https://arxiv.org/html/2605.20423#bib.bib4)\], andFANToM\[[36](https://arxiv.org/html/2605.20423#bib.bib2)\]\. The baselines include Llama\-3\.1\-8B\-Base\[[10](https://arxiv.org/html/2605.20423#bib.bib23)\], Mistral\-NeMo\-12B\[[21](https://arxiv.org/html/2605.20423#bib.bib32)\], Phi\-3\-Medium\-14B\[[1](https://arxiv.org/html/2605.20423#bib.bib33)\], Qwen2\.5\-14B and Qwen2\.5\-32B\[[35](https://arxiv.org/html/2605.20423#bib.bib34)\], Gemma\-2\-27B\[[13](https://arxiv.org/html/2605.20423#bib.bib35)\], and the reported results of ExploreToM\[[28](https://arxiv.org/html/2605.20423#bib.bib1)\]\. Full results are shown in Table[2](https://arxiv.org/html/2605.20423#S4.T2)\.

Table 2:Accuracy \(%\) comparison across four Theory of Mind benchmarks\.Bolddenotes the best result per column\.†denotes results as reported in the original paper, produced via iterative A\* heuristic search rather than direct model inference\.![Refer to caption](https://arxiv.org/html/2605.20423v1/images/model_comparison_bars.png)Figure 8:Bar chart comparison of benchmark accuracy across all evaluated models on ToMi, Hi\-ToM, BigToM, and FANToM\. OSCToM\-8B achieves the best overall accuracy profile despite its smaller parameter count\.OSCToM\-8B gives the highest accuracy onToMi\(79\.5%\) andFANToM\(76\.0%\)\. It also exceeds several models with three to four times more parameters\. FANToM is especially important because it uses multi\-party conversations in which characters have asymmetric access to the true state of the world\[[36](https://arxiv.org/html/2605.20423#bib.bib2)\]\. A model must update nested belief states as the dialogue changes\. ExploreToM reports 0\.2% accuracy on FANToM, while OSCToM\-8B reaches 76\.0%\. This difference suggests a limitation of A\*\-guided heuristic synthesis: optimizing for the amount of information does not necessarily produce training data that teaches the model to reason about divergent nested beliefs in dynamic conversations\. Figure[8](https://arxiv.org/html/2605.20423#S4.F8)gives the visual comparison across all models\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/efficiency_scatter.png)Figure 9:Scatter plot of model size \(parameters\) versus mean benchmark accuracy\. OSCToM\-8B shows the best accuracy\-to\-size ratio among all evaluated models\.OnBigToM, OSCToM\-8B reaches 89\.8%, close to Mistral\-NeMo\-12B at 90\.5%, despite having 4 billion fewer parameters\. OnHi\-ToM, OSCToM\-8B reaches 65\.3%, matching Phi\-3\-Medium\-14B and exceeding Gemma\-2\-27B \(20\.6%\) and Mistral\-NeMo\-12B \(25\.6%\)\. These results support the value of the two\-stage curriculum, which exposes the model to recursive belief conflict in increasing order of difficulty\.

### 4\.2Inference Efficiency

A second important dimension of the evaluation concerns inference efficiency\. Current ToM systems such as ExploreToM rely on iterative A\* search, which requires multiple sequential LLM calls per query to construct and validate an answer\. This procedure has an inference complexity of𝒪​\(N\)\\mathcal\{O\}\(N\), whereNNis the number of search steps, and results in an average response latency of 15\.0 seconds per query\. OSCToM\-8B operates through direct neural inference—a single forward pass through the fine\-tuned model—achieving𝒪​\(1\)\\mathcal\{O\}\(1\)inference complexity and a measured average latency of 2\.62 seconds\. This architectural difference is summarized in Table[3](https://arxiv.org/html/2605.20423#S4.T3)\.

Table 3:Inference characteristics of ExploreToM versus OSCToM\-8B\. The A\* search method used by ExploreToM is iterative and hardware\-intensive; OSCToM\-8B performs constant\-time single\-pass inference\.Figures[10](https://arxiv.org/html/2605.20423#S4.F10)and[9](https://arxiv.org/html/2605.20423#S4.F9)summarize the efficiency results\. Figure[10](https://arxiv.org/html/2605.20423#S4.F10)compares throughput with benchmark accuracy and shows that OSCToM\-8B avoids the iterative overhead of search\-based methods\. Figure[9](https://arxiv.org/html/2605.20423#S4.F9)compares model size with aggregate benchmark accuracy\. In this view, OSCToM\-8B falls in the best region, with the highest combined accuracy per parameter count among the evaluated systems\. These results suggest that adversarial training data and curriculum fine\-tuning can improve both capability and computational practicality\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/inference_efficiency.png)Figure 10:Inference throughput comparison across models\. OSCToM\-8B achieves a 5\.7x reduction in average latency relative to ExploreToM’s A\* search procedure while maintaining superior accuracy\.
### 4\.3Discussion and Limitations

The results indicate that OSCToM is most useful when a task requires a model to keep track of several belief states at the same time\. This is different from simple false\-belief tests, where the main difficulty is often to identify what one character does or does not know\. In Observer\-Self Conflict scenarios, the model must also separate an observer’s own belief from the belief that the observer attributes to another agent\. This separation is the main reason the generated stories are useful for high\-order ToM training\. It gives the model repeated exposure to cases where the surface narrative is not enough and where the answer depends on the structure of nested mental states\.

There are still limitations\. The current DSL focuses mainly on belief conflicts built from location, observation, communication, and deception events\. These operators are useful for controlled ToM evaluation, but they do not cover all forms of social reasoning\. Human social understanding also includes emotion, uncertainty, memory, intention, trust, and moral judgment\. Another limitation is that the generated stories are text\-only\. In real settings, beliefs may depend on visual cues, gestures, timing, or shared physical context\. Extending OSCToM to those settings would require new state representations and new verification rules\.

Finally, the surrogate pipeline makes generation efficient, but it also introduces a dependency on the quality of the surrogate labels\. If the surrogate modules assign high scores to stories that are complex in form but weak in actual reasoning content, the generator may learn patterns that look difficult without being genuinely useful\. For this reason, future versions of the framework should include stronger human or model\-based audits of generated stories, especially for the highest difficulty tiers\. These limitations do not change the reported results, but they clarify where the framework is strongest and where further validation is needed\.

## 5Conclusion

This paper introducedOSCToM, a framework for generating and training on adversarial Theory of Mind scenarios based on Observer\-Self Conflict\. This conflict occurs when an agent’s recursive model of another agent’s belief contradicts the agent’s own internal knowledge\. OSCToM combines an Extended DSL, a compositional surrogate pipeline, a DQN\-guided RL generator, and a two\-stage curriculum fine\-tuning strategy for Llama\-3\.1\-8B\-Instruct\[[10](https://arxiv.org/html/2605.20423#bib.bib23)\]\. The resulting OSCToM\-8B model reaches76%accuracy on FANToM\[[36](https://arxiv.org/html/2605.20423#bib.bib2)\], compared with the0\.2%reported by ExploreToM\[[28](https://arxiv.org/html/2605.20423#bib.bib1)\]\. It also matches or exceeds models with up to four times more parameters\. In addition, OSCToM\-8B reduces inference latency by5\.7xrelative to A\*\-based pipeline approaches, making it a practical system for high\-order social reasoning\.

Future work will extend the DSL beyond locative belief conflicts to include emotional and epistemic states\. We also plan to develop a dedicated OSCT evaluation benchmark and study multi\-modal settings where belief states are distributed across visual and conversational evidence\[[34](https://arxiv.org/html/2605.20423#bib.bib20)\]\. Overall, these results suggest that principled adversarial data construction can be an efficient path toward more generalizable social intelligence in large language models\.

## References

- \[1\]\(2024\)Phi\-3 technical report: a highly capable language model locally on your phone\.External Links:2404\.14219Cited by:[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.20423#S4.T2.3.5.3.1)\.
- \[2\]T\. Akiba, S\. Sano, T\. Yanase, T\. Ohta, and M\. Koyama\(2019\)Optuna: a next\-generation hyperparameter optimization framework\.InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,pp\. 2623–2631\.Cited by:[§A\.1](https://arxiv.org/html/2605.20423#A1.SS1.p1.5)\.
- \[3\]S\. Baron\-Cohen, A\. M\. Leslie, and U\. Frith\(1985\)Does the autistic child have a "theory of mind"?\.Cognition21\(1\),pp\. 37–46\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p1.1)\.
- \[4\]C\. Becchio, J\. W\. Strachan,et al\.\(2024\)Testing theory of mind in large language models and humans\.Nature Human Behaviour8\(7\),pp\. 1285–1295\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p2.1)\.
- \[5\]Y\. Bengio, J\. Louradour, R\. Collobert, and J\. Weston\(2009\)Curriculum learning\.InProceedings of the 26th Annual International Conference on Machine Learning \(ICML\),pp\. 41–48\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p5.1),[§3\.5](https://arxiv.org/html/2605.20423#S3.SS5.p1.1)\.
- \[6\]M\. Binz and E\. Schulz\(2023\)Using cognitive psychology to understand gpt\-3\.Proceedings of the National Academy of Sciences120\(6\)\.Cited by:[§1](https://arxiv.org/html/2605.20423#S1.p2.1),[§2](https://arxiv.org/html/2605.20423#S2.p1.1)\.
- \[7\]S\. Bubecket al\.\(2023\)Sparks of artificial general intelligence: early experiments with gpt\-4\.arXiv preprint arXiv:2303\.12712\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p1.1)\.
- \[8\]C\. Chanet al\.\(2024\)NegotiationToM: a benchmark for theory of mind in negotiation\.arXiv preprint arXiv:2405\.XXXXX\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p3.1)\.
- \[9\]Z\. Chenet al\.\(2024\)ToMBench: benchmarking theory of mind in large language models\.arXiv preprint arXiv:2402\.15052\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p2.1)\.
- \[10\]A\. Dubey, A\. Jauhri, A\. Pandey, A\. Keneally,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§3\.2](https://arxiv.org/html/2605.20423#S3.SS2.p2.1),[§3\.4](https://arxiv.org/html/2605.20423#S3.SS4.p2.2),[§3\.5](https://arxiv.org/html/2605.20423#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.20423#S4.T2.3.3.1.1),[§5](https://arxiv.org/html/2605.20423#S5.p1.1)\.
- \[11\]C\. D\. Frith and U\. Frith\(2006\)The neural basis of mentalizing\.Neuron50\(4\),pp\. 531–534\.Cited by:[§3\.2](https://arxiv.org/html/2605.20423#S3.SS2.p3.1)\.
- \[12\]K\. Gandhi, T\. Gerstenberg, and N\. D\. Goodman\(2023\)Understanding social reasoning in language models with bigtom\.arXiv preprint arXiv:2307\.03104\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1)\.
- \[13\]Gemma Team, Google DeepMind\(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118Cited by:[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.20423#S4.T2.3.8.6.1)\.
- \[14\]D\. Han and team\(2023\)Unsloth: fine\-tuning llama models 2x faster with 70% less memory\.Note:[https://github\.com/unslothai/unsloth](https://github.com/unslothai/unsloth)Cited by:[§3\.5](https://arxiv.org/html/2605.20423#S3.SS5.p3.5)\.
- \[15\]G\. Hinton, O\. Vinyals, and J\. Dean\(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§3\.2](https://arxiv.org/html/2605.20423#S3.SS2.p2.1)\.
- \[16\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.arXiv preprint arXiv:2106\.09685\.Cited by:[§3\.5](https://arxiv.org/html/2605.20423#S3.SS5.p3.5)\.
- \[17\]F\. Hutter, H\. H\. Hoos, and K\. Leyton\-Brown\(2014\)An efficient approach for assessing hyperparameter importance\.InProceedings of the 31st International Conference on Machine Learning \(ICML\),pp\. 754–762\.Cited by:[Figure 13](https://arxiv.org/html/2605.20423#A1.F13.1)\.
- \[18\]M\. Kosinski\(2023\)Theory of mind may have spontaneously emerged in large language models\.arXiv preprint arXiv:2302\.02083\.Cited by:[§1](https://arxiv.org/html/2605.20423#S1.p2.1),[§2](https://arxiv.org/html/2605.20423#S2.p1.1)\.
- \[19\]H\. Le, Y\.\-L\. Boureau, and M\. Nickel\(2019\)ToMi: a diagnostic benchmark for theory of mind in text\.arXiv preprint arXiv:1911\.12310\.Cited by:[§1](https://arxiv.org/html/2605.20423#S1.p2.1),[§2](https://arxiv.org/html/2605.20423#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1)\.
- \[20\]Z\. Ma, J\. Sansom, R\. Peng, and J\. Chai\(2023\)ToMChallenges: a principle\-guided dataset and diverse evaluation tasks for exploring theory of mind\.arXiv preprint arXiv:2305\.15068\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p2.1)\.
- \[21\]Mistral AI and NVIDIA\(2024\)Mistral nemo 12b\.Note:[https://mistral\.ai/news/mistral\-nemo/](https://mistral.ai/news/mistral-nemo/)Cited by:[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.20423#S4.T2.3.4.2.1)\.
- \[22\]V\. Mnih, A\. P\. Badia, M\. Mirza, A\. Graves, T\. Lillicrap, T\. Harley, D\. Silver, and K\. Kavukcuoglu\(2016\)Asynchronous methods for deep reinforcement learning\.InInternational Conference on Machine Learning,pp\. 1928–1937\.Cited by:[§3\.3](https://arxiv.org/html/2605.20423#S3.SS3.p1.1)\.
- \[23\]V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.Cited by:[§3\.3](https://arxiv.org/html/2605.20423#S3.SS3.p1.1)\.
- \[24\]D\. Premack and G\. Woodruff\(1978\)Does the chimpanzee have a theory of mind?\.Behavioral and Brain Sciences1\(4\),pp\. 515–526\.Cited by:[§1](https://arxiv.org/html/2605.20423#S1.p1.1),[§2](https://arxiv.org/html/2605.20423#S2.p1.1)\.
- \[25\]V\. Sanh, L\. Debut, J\. Chaumond, and T\. Wolf\(2019\)DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter\.arXiv preprint arXiv:1910\.01108\.Cited by:[§3\.2](https://arxiv.org/html/2605.20423#S3.SS2.p2.1)\.
- \[26\]M\. Sapet al\.\(2023\)Neural theory of mind? on the limits of social intelligence in large language models\.arXiv preprint arXiv:2310\.13622\.Cited by:[§1](https://arxiv.org/html/2605.20423#S1.p2.1),[§2](https://arxiv.org/html/2605.20423#S2.p1.1)\.
- \[27\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§3\.3](https://arxiv.org/html/2605.20423#S3.SS3.p1.1)\.
- \[28\]M\. Sclar, Y\. Choi, R\. West, F\. Brahman, and C\. Bhagavatula\(2024\)Explore theory of mind: program\-guided adversarial data generation for theory of mind reasoning\.arXiv preprint arXiv:2410\.02844\.Cited by:[§1](https://arxiv.org/html/2605.20423#S1.p3.1),[§2](https://arxiv.org/html/2605.20423#S2.p4.1),[§3\.1](https://arxiv.org/html/2605.20423#S3.SS1.p1.1),[§3\.3](https://arxiv.org/html/2605.20423#S3.SS3.p2.1),[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.20423#S4.T2.3.1.1),[Table 3](https://arxiv.org/html/2605.20423#S4.T3.2.3.1.2),[§4](https://arxiv.org/html/2605.20423#S4.p1.1),[§5](https://arxiv.org/html/2605.20423#S5.p1.1)\.
- \[29\]N\. Shapira, O\. Levy,et al\.\(2023\)Clever hans or neural theory of mind? stress testing social reasoning in large language models\.arXiv preprint arXiv:2305\.14763\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p2.1)\.
- \[30\]W\. Streetet al\.\(2024\)LLMs achieve adult human performance on higher\-order theory of mind tasks\.arXiv preprint arXiv:2405\.18870\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p2.1)\.
- \[31\]M\. Towers, A\. Kwiatkowski, J\. Terry, J\. U\. Balis, C\. De Witt, C\. Glöckner, A\. Kallinteris, S\. Ktena, M\. C\. Machado,et al\.\(2024\)Gymnasium: a standard interface for reinforcement learning environments\.arXiv preprint arXiv:2407\.17032\.Cited by:[§A\.2](https://arxiv.org/html/2605.20423#A1.SS2.p1.1)\.
- \[32\]T\. Ullman\(2023\)Large language models fail on trivial alterations to theory\-of\-mind tasks\.arXiv preprint arXiv:2302\.08399\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p1.1)\.
- \[33\]H\. Wimmer and J\. Perner\(1983\)Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception\.Cognition13\(1\),pp\. 103–128\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p1.1),[1st item](https://arxiv.org/html/2605.20423#S3.I1.i1.p1.1)\.
- \[34\]Y\. Xuet al\.\(2024\)OpenToM: a comprehensive video\-based theory of mind benchmark\.arXiv preprint arXiv:2402\.XXXXX\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p3.1),[§5](https://arxiv.org/html/2605.20423#S5.p2.1)\.
- \[35\]A\. Yang, B\. Yang, B\. Hui,et al\.\(2024\)Qwen2\.5 technical report\.External Links:2412\.15115Cited by:[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.20423#S4.T2.3.6.4.1),[Table 2](https://arxiv.org/html/2605.20423#S4.T2.3.7.5.1)\.
- \[36\]R\. Zheng, S\. Han, Y\. Kan, Z\. Ma, Y\. Zhang, N\. Peng, S\. Xia, Z\. Chen, J\. Gui, Y\. Guan,et al\.\(2023\)FANToM: a benchmark for stress\-testing theory of mind in information\-asymmetric conversations\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§1](https://arxiv.org/html/2605.20423#S1.p6.1),[§2](https://arxiv.org/html/2605.20423#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p2.1),[§5](https://arxiv.org/html/2605.20423#S5.p1.1)\.
- \[37\]J\. Zhu, R\. Zhang, Y\. Wang, Y\. Liu,et al\.\(2023\)Hi\-tom: a higher\-order theory of mind benchmark for large language models\.arXiv preprint arXiv:2310\.16542\.Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p2.1),[5th item](https://arxiv.org/html/2605.20423#S3.I1.i5.p1.1),[§4\.1](https://arxiv.org/html/2605.20423#S4.SS1.p1.1)\.
- \[38\]W\. Zhu, Z\. Zhang, and Y\. Wang\(2024\)Language models represent beliefs of self and others\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.20423#S2.p3.1)\.

## Appendix AAppendix

### A\.1Hyperparameter Tuning of the DQN Generator

Hyperparameter optimization was performed using theOptunaframework\[[2](https://arxiv.org/html/2605.20423#bib.bib39)\]with a Tree\-structured Parzen Estimator \(TPE\) sampler across 25 trials, each evaluated over 50,000 timesteps and 10 test episodes\. The search spanned 11 parameters including learning rate, buffer size, discount factorγ\\gamma, soft update coefficientτ\\tau, and theε\\varepsilon\-greedy exploration schedule\. Of the 25 trials, 9 returned−∞\-\\inftyreward due to policy divergence\. Analysis of these failures shows a consistent pattern: all divergent trials combined a small replay buffer \(≤10,000\\leq 10\{,\}000transitions\) with a high learning rate, causing the Q\-function estimates to become unstable before sufficient experience had accumulated\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/optimization_history.png)Figure 11:Optuna optimization history\. The optimal reward of 0\.900 was reached at Trial 4 and confirmed by subsequent trials, indicating successful convergence of the TPE sampler\.The best configuration \(Trial 4\) specifies a learning rate of5\.95×10−45\.95\\times 10^\{\-4\}, buffer size of100,000100\{,\}000,τ=0\.019\\tau=0\.019,γ=0\.902\\gamma=0\.902, and train frequency of 8 steps with 5 gradient updates per step\. The especially lowτ\\tauvalue ensures the target network updates gradually, preventing Q\-value overestimation in the non\-stationary surrogate\-reward environment\. Fanova importance analysis shows that buffer size is the single most deterministic factor in training stability, with all high\-reward trials \(\>0\.85\>0\.85\) sharing a buffer of100,000100\{,\}000regardless of other parameter values\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/param_importances.png)Figure 12:Fanova\-estimated hyperparameter importance\[[17](https://arxiv.org/html/2605.20423#bib.bib38)\]\. Buffer size and learning rate are dominant\.
![Refer to caption](https://arxiv.org/html/2605.20423v1/images/parallel_coordinate.png)Figure 13:Parallel coordinates of all 25 trials\. High\-reward lines cluster consistently at buffer=100,000=100\{,\}000\.

These results confirm that the DQN generator operated under a carefully validated configuration, and that the benchmark results in Section 4 reflect learned policy behavior rather than favorable default settings\.

### A\.2Randomization Test of the DQN Generator

To verify that the DQN generator does not exhibit mode collapse, the tendency of a trained policy to repeatedly produce structurally identical trajectories, a randomization test was performed over 20 independent episodes on theToMStoryEnvenvironment\[[31](https://arxiv.org/html/2605.20423#bib.bib30)\]\. Three dimensions were evaluated: action\-space coverage, story uniqueness, and character diversity\.

All 15 discrete DSL actions were used across 300 timesteps \(100% coverage\), with no single action dominating the distribution\. Notably, high\-complexity operators such asdouble\_bluff\(19 times\) andone\_way\_mirror\_observation\(24 times\) were selected at rates comparable to simpler physical actions, confirming that the surrogate reward effectively encouraged cognitively complex behavior\. Story lengths ranged from 5 to 16 events \(mean: 9\.65,σ=2\.26\\sigma=2\.26\) and all 20 generated stories were structurally unique with zero duplicates\.

![Refer to caption](https://arxiv.org/html/2605.20423v1/images/action_distribution.png)Figure 14:Action distribution across 300 timesteps\. All 15 DSL operators are represented, confirming full action\-space coverage\.
![Refer to caption](https://arxiv.org/html/2605.20423v1/images/story_diversity.png)Figure 15:Story diversity metrics, confirming 100% story uniqueness and consistent structural variation across episodes\.

Character diversity reached 89\.7% \(52 unique characters from 58 sampled\), with no single character appearing in more than 3 of the 20 episodes\. The randomization test returned a formal verdict ofPASS, confirming that the OSCT dataset was produced by a generative policy, a necessary precondition for the curriculum fine\-tuning in Section 3\.6 to yield a model capable of generalizing across the full range of Observer\-Self Conflict configurations\.

Similar Articles

Theory of Mind in Action: The Instruction Inference Task in Dynamic Human-Agent Collaboration

arXiv cs.CL

This paper introduces the Instruction Inference task to evaluate Theory of Mind capabilities in LLM-based agents during human-agent collaboration with incomplete or ambiguous instructions. The authors present Tomcat, an LLM agent tested on GPT-4o, DeepSeek-R1, and Gemma-3-27B, demonstrating performance comparable to human participants in inferring unspoken intentions.

Detecting misbehavior in frontier reasoning models

OpenAI Blog

OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.

NoisyCoconut: Counterfactual Consensus via Latent Space Reasoning

arXiv cs.LG

This paper introduces NoisyCoconut, an inference-time method that improves LLM reliability by injecting noise into latent trajectories to generate diverse reasoning paths. The approach enables models to abstain when uncertain, significantly reducing error rates in mathematical reasoning tasks without requiring retraining.