Mind the Perspective: Let's Reason Recursively for Theory of Mind
Summary
Introducing RecToM, an inference-time framework that models nested beliefs via recursive perspective construction for Theory of Mind reasoning in LLMs, achieving state-of-the-art performance on multiple benchmarks.
View Cached Full Text
Cached at: 06/11/26, 01:48 PM
# Mind the Perspective: Let’s Reason Recursively for Theory of Mind
Source: [https://arxiv.org/html/2606.11724](https://arxiv.org/html/2606.11724)
Chao Lei1, Guang Hu1, Meng Yang2, Yanbei Jiang1, Nir Lipovetzky1 1School of Computing and Information Systems, The University of Melbourne, Australia 2SensiLab, Monash University, Australia \{clei1,ghu1,yanbeij\}@student\.unimelb\.edu\.au Meng\.Yang@monash\.edu, nir\.lipovetzky@unimelb\.edu\.au
###### Abstract
Theory of Mind \(ToM\) reasoning requires inferring agents’ beliefs from partial and asymmetric observations, which remains an open challenge for LLMs\. Existing prompting\-based approaches improve ToM reasoning through observable\-event filtering or temporal belief chains, without explicitly modeling nested beliefs\. We introduceRecToM, an inference\-time framework for ToM reasoning that models nested beliefs via recursive perspective construction\.RecToMconstructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher\-order belief questions to actual\-world questions within the final constructed perspective\. We further provide a KD45 analysis showing thatRecToM’s perspective construction induces a well\-formed belief modality beyond simple event filtering\. Experiments on ToM benchmarks, including Hi\-ToM, Big\-ToM, and FanToM, across multiple LLM backbones show thatRecToMconsistently outperforms recent advanced approaches, achieving state\-of\-the\-art performance\. Notably,RecToMreaches 100% accuracy on Hi\-ToM with GPT\-5\.4 and Qwen3\.5, a benchmark requiring higher\-order ToM reasoning\.
Mind the Perspective: Let’s Reason Recursively for Theory of Mind
Chao Lei1, Guang Hu1, Meng Yang2, Yanbei Jiang1, Nir Lipovetzky11School of Computing and Information Systems, The University of Melbourne, Australia2SensiLab, Monash University, Australia\{clei1,ghu1,yanbeij\}@student\.unimelb\.edu\.auMeng\.Yang@monash\.edu, nir\.lipovetzky@unimelb\.edu\.au
## 1Introduction
Theory of Mind \(ToM\), the ability to reason about others’ beliefs, knowledge, and perspectives, is a central component of social intelligence\(Premack and Woodruff,[1978](https://arxiv.org/html/2606.11724#bib.bib1); Wimmer and Perner,[1983](https://arxiv.org/html/2606.11724#bib.bib32); Baron\-Cohenet al\.,[1985](https://arxiv.org/html/2606.11724#bib.bib33)\)\. For Large Language Models \(LLMs\) in interactive settings, ToM is central to handling asymmetric information, multi\-agent coordination, and belief\-dependent decision making\(Rabinowitzet al\.,[2018](https://arxiv.org/html/2606.11724#bib.bib27); Sapet al\.,[2022](https://arxiv.org/html/2606.11724#bib.bib28); Gandhiet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib29)\)\. However, recent studies show that even strong LLMs remain unreliable on ToM tasks that require reconstructing agent\-specific beliefs from partial observations, rather than predicting the final world state\.\(Sapet al\.,[2022](https://arxiv.org/html/2606.11724#bib.bib28); Gandhiet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib29); Wuet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib7); Kimet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib6)\)\.
For instance, in a Sally\-Anne\-style false\-belief paradigm, if Alice leaves after placing an object in a box and Bob later moves it to a drawer, Alice would believe the object remains in the box, whereas predicting the drawer reflects an omniscient\-state bias\(Wimmer and Perner,[1983](https://arxiv.org/html/2606.11724#bib.bib32); Baron\-Cohenet al\.,[1985](https://arxiv.org/html/2606.11724#bib.bib33)\)\. Such cases illustrate the epistemic nature of ToM reasoning, where successful reasoning must identify agent\-specific observability, preserve beliefs across unobserved intervals, and revise beliefs only under observed relevant evidence\. Higher\-order questions, such as the second\-order question “Where does Alice think Bob will search?”, further challenge ToM reasoning since they require nested belief construction, in which Bob’s belief must be represented within Alice’s perspective rather than inferred as Bob’s actual belief\(Perner and Wimmer,[1985](https://arxiv.org/html/2606.11724#bib.bib31); Wuet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib7)\)\.
Recent prompting\-based methods address parts of this challenge through structured intermediate reasoning\.SimToMuses a two\-stage prompting procedure: it first filters the story to the events observable to each character in question, and then prompts the model to answer the ToM question using only the filtered context\(Wilfet al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib15)\)\.TimeToM\(Houet al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib17)\)introduces a temporal space by assigning time points to story sentences or dialogue utterances, and constructs a Temporal Belief State Chain \(TBSC\) for each character\. It further separates TBSC into self\-world beliefs, which record a character’s belief, and social\-world beliefs, which record beliefs about other characters’ actions that may create belief gaps\. The former is used for first\-order ToM questions, whereas the latter supports higher\-order ToM reasoning\. For higher\-order questions, its Time\-Aware Belief Solver identifies each character’s accessible time points, intersects them as belief\-communication periods, and reduces higher\-order questions to first\-order questions within these periods\.SimToMandTimeToMshow that ToM reasoning benefits from intermediate representations that specify observable events for each character and the temporal evolution of character beliefs\. However, neitherSimToMnorTimeToMexplicitly constructs nested beliefs for reasoning over one character’s belief within another character’s perspective in higher\-order ToM questions\. See Appendix[A](https://arxiv.org/html/2606.11724#A1)for detailed related work\.
To address this limitation, we introduceRecToM, an inference\-time framework that formulates ToM reasoning as recursive symbolic perspective construction for nested belief modeling\.RecToMexplicitly models belief states under partial observability, enabling beliefs to persist across unobserved intervals, update under observed evidence, and nest across character perspectives\. For each ToM task, each narrative statement or dialogue utterance is abstracted into a fact\-based symbolic event and classified as either persistent or transient\. Persistent events introduce or revise ontic facts, such as object locations, character locations, and character presence, whereas transient events, such as communications, claims, and questions, allow belief updates under partial observability\.RecToMconstructs a global state\-event sequence by accumulating ontic facts over the event sequence to build fact\-based states and pairing each state with its corresponding persistent or transient event\.
From the global state\-event sequence,RecToMconstructs perspectives recursively\. The global perspective preserves the complete state\-event sequence and serves as the initial source perspective\. For each character specified in the belief question,RecToMconstructs the character’s perspective in order by completing the character’s partial observation over the current source perspective\. Observable states and events are retained locally, unobservable events are removed, and unobservable states are completed by inheriting the preceding belief state and revising it with the paired observable events\. The newly constructed perspective then becomes the source perspective for the next character\. In this way, nested beliefs are evaluated relative to preceding perspectives rather than the omniscient narrative, reducing higher\-order belief questions to actual\-world questions within the final constructed perspective, whose final state determines the answer under a closed\-world assumption\.
When evaluated on well\-established ToM benchmarks, including Hi\-ToM\(Wuet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib7)\), Big\-ToM\(Gandhiet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib29)\), and FanToM\(Kimet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib6)\),RecToMconsistently outperformsSimToMandTimeToMacross multiple LLMs, demonstrating state\-of\-the\-art performance\. Its advantage is most evident on Hi\-ToM, which includes up to fourth\-order questions, whereRecToMachieves 100% accuracy on most evaluated LLMs\. These results indicate that recursive perspective construction with explicit fact\-based state representations yields robust gains in higher\-order and information\-asymmetric ToM reasoning\. We outline our contributions as follows:
- •We introduceRecToM, an inference\-time framework that formulates ToM reasoning as recursive perspective construction, modeling nested beliefs with respect to preceding character perspectives\.
- •We provide a formal analysis showing thatRecToM’s perspective construction induces a well\-formed belief modality satisfying KD45\.
- •We conduct extensive experiments on Hi\-ToM, Big\-ToM, and FanToM across diverse LLM backbones, showing thatRecToMconsistently outperforms current state\-of\-the\-art approaches\.
## 2Problem Formulation
We define a ToM instance as
ℐ=\(E,q,y\),E=\(e1,…,eT\),\\mathcal\{I\}=\(E,q,y\),\\qquad E=\(e\_\{1\},\\ldots,e\_\{T\}\),whereEEis an ordered event sequence, each eventet∈Ee\_\{t\}\\in Ecorresponds to a narrative statement or dialogue utterance,qqis the question, andyyis the ground\-truth answer\.
Let𝒜\(E\)\\mathcal\{A\}\(E\)denote the characters appearing inEE\. For a questionqq, we define its character chain as
C\(q\)=\(a1,…,aK\),ai∈𝒜\(E\),C\(q\)=\(a\_\{1\},\\ldots,a\_\{K\}\),\\qquad a\_\{i\}\\in\\mathcal\{A\}\(E\),whereC\(q\)C\(q\)lists the characters inqqfrom the outermost belief holdera1a\_\{1\}to the innermost belief holderaKa\_\{K\}, andKKdenotes the belief\-nesting order\. Figure[1](https://arxiv.org/html/2606.11724#S2.F1)illustrates a Hi\-ToM instance, where eachete\_\{t\}is a narrative statement\. The zero\-order \(actual\-world\) question hasK=0K=0andC\(q\)=∅C\(q\)=\\emptyset; the first\-order belief question hasK=1K=1andC\(q\)=\(Elizabeth\)C\(q\)=\(\\mathrm\{Elizabeth\}\); and the third\-order belief question, a higher\-order case withK\>1K\>1, hasK=3K=3withC\(q\)=\(Elizabeth,Isabella,Jacob\)C\(q\)=\(\\mathrm\{Elizabeth\},\\mathrm\{Isabella\},\\mathrm\{Jacob\}\)\.
Figure 1:An example Hi\-ToM instance with zero\-order, first\-order, and third\-order belief questions over a narrative event sequence\.Figure 2:Illustration of the fullRecToMprocedure for solving the Hi\-ToM instance in Figure[1](https://arxiv.org/html/2606.11724#S2.F1)\. Narrative events are abstracted into fact\-based persistent eventsπi\\pi\_\{i\}and transient eventsτi\\tau\_\{i\}, and accumulated into a global state\-event sequence\{\(st,et\)\}t=1T\\\{\(s\_\{t\},e\_\{t\}\)\\\}\_\{t=1\}^\{T\}, withT=9T=9in this example \(Step 1\)\. Character perspectives are recursively constructed by retaining observable states and events, completing unobserved states through belief persistence, and updating beliefs with observable events \(Step 2\)\. Questions of different orders are then reduced to zero\-order questions and answered from the final state of the corresponding constructed perspective \(Step 3\)\.
## 3RecToMOverview
RecToMconsists of three main steps: 1\) global state\-event sequence construction; 2\) recursive perspective generation for the characters inC\(q\)C\(q\); and 3\) answer inference from the constructed perspective\. The overall procedure ofRecToMis illustrated in Figure[2](https://arxiv.org/html/2606.11724#S2.F2)\.
### 3\.1Global State\-Event Sequence Construction
#### 3\.1\.1Fact\-Based Event Abstraction\.
Given the event sequenceEE,RecToMfirst abstracts each eventet∈Ee\_\{t\}\\in Einto a fact\-based symbolic representation and classifies it as either a persistent eventπt\\pi\_\{t\}or a transient eventτt\\tau\_\{t\}\. A persistent eventπt\\pi\_\{t\}introduces or revises ontic facts about world conditions, such as object locations, character locations, and character presence, which remain valid until explicitly revised by later persistent events\. In contrast, a transient eventτt\\tau\_\{t\}records an event occurrence, such as a communication, claim, or question, that updates characters’ beliefs by modifying the facts when they are unobservable\. We note thatRecToMgenerates the fact\-based representation using the structured abstraction prompt and determinesπt\\pi\_\{t\}andτt\\tau\_\{t\}according to the task description\. For concision, we useete\_\{t\}to denote the abstracted fact\-based event in the following sections, unless explicitly noted\.
#### 3\.1\.2Global State\-Event Sequence
RecToMconstructs the symbolic statests\_\{t\}for each eventete\_\{t\}by updatingst−1s\_\{t\-1\}with added factsΔt\+\\Delta\_\{t\}^\{\+\}and removed factsΔt−\\Delta\_\{t\}^\{\-\}, initialized withs0=∅s\_\{0\}=\\emptyset:
st=\(st−1∖Δt−\)∪Δt\+,t=1,…,T\.s\_\{t\}=\(s\_\{t\-1\}\\\!\\setminus\\\!\\Delta\_\{t\}^\{\-\}\)\\cup\\Delta\_\{t\}^\{\+\},\\;t=1,\\ldots,T\.\(1\)Whenet=πte\_\{t\}=\\pi\_\{t\},Δt\+\\Delta\_\{t\}^\{\+\}andΔt−\\Delta\_\{t\}^\{\-\}are specified by the persistent event; whenet=τte\_\{t\}=\\tau\_\{t\},Δt\+=Δt−=∅\\Delta\_\{t\}^\{\+\}=\\Delta\_\{t\}^\{\-\}=\\emptyset, yieldingst=st−1s\_\{t\}=s\_\{t\-1\}\. Thus,sts\_\{t\}accumulates the ontic facts that hold after processingete\_\{t\}, providing a complete state representation for belief modeling\.RecToMpreserves each abstract eventete\_\{t\}at its original position, resulting in the global state\-event sequence:
G=\{\(st,et\)\}t=1T,G=\\\{\(s\_\{t\},e\_\{t\}\)\\\}\_\{t=1\}^\{T\},where each pair\(st,et\)\(s\_\{t\},e\_\{t\}\)aligns the symbolic state with its corresponding event, instantiated as eitherπt\\pi\_\{t\}orτt\\tau\_\{t\}\.Step 1in Figure[2](https://arxiv.org/html/2606.11724#S2.F2)illustrates the construction ofGGfor the Hi\-ToM example in Figure[1](https://arxiv.org/html/2606.11724#S2.F1)\.
### 3\.2Perspective Construction
#### 3\.2\.1Global Perspective
RecToMtreats the global state\-event sequenceGGas the global perspectiveP\(∗\|G\)P^\{\(\*\|G\)\}, sinceGGcontains complete state\-event information independent of character’s belief or observation\. This allowsRecToMto answer zero\-order questions about real\-world conditions, which are encoded in the final state ofP\(∗\|G\)P^\{\(\*\|G\)\}, without character\-specific belief inference\.
#### 3\.2\.2Character Perspective
A character perspective encodes the character’s belief across the states, combining what the character has observed with what the character continues to believe for unobserved states\. Given a source perspectiveP\(⋅\)=\{\(st,et\)\}t=1TP^\{\(\\cdot\)\}=\\\{\(s\_\{t\},e\_\{t\}\)\\\}\_\{t=1\}^\{T\},RecToMconstructs the perspective of characteraia\_\{i\}, denoted byP\(ai\|P\(⋅\)\)P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}, in two steps\.
##### Character Partial Observation\.
To model what characteraia\_\{i\}has observed,RecToMderives a partial observation sequenceO\(ai\|P\(⋅\)\)=\{\(s¯t,e¯t\)\}t=1TO^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}=\\\{\(\\bar\{s\}\_\{t\},\\bar\{e\}\_\{t\}\)\\\}\_\{t=1\}^\{T\}overP\(⋅\)P^\{\(\\cdot\)\}foraia\_\{i\}, wheres¯t=st\\bar\{s\}\_\{t\}=s\_\{t\}ifsts\_\{t\}is observable toaia\_\{i\}ands¯t=∅\\bar\{s\}\_\{t\}=\\emptysetotherwise, ande¯t=et\\bar\{e\}\_\{t\}=e\_\{t\}ifete\_\{t\}is observable toaia\_\{i\}ande¯t=∅\\bar\{e\}\_\{t\}=\\emptysetotherwise\. Thus, observable states and events inP\(⋅\)P^\{\(\\cdot\)\}are retained, unobservable events are discarded, and unobservable states are masked for later completion\.RecToMevaluates each pair\(st,et\)\(s\_\{t\},e\_\{t\}\)inP\(⋅\)P^\{\(\\cdot\)\}according to the task\-specified observability rules\. For example, in Hi\-ToM, a statests\_\{t\}is observable toaia\_\{i\}when it contains presence facts foraia\_\{i\}, such asin\_room\(ai\)\(a\_\{i\}\); a persistent eventet=πte\_\{t\}=\\pi\_\{t\}is observable if its paired statests\_\{t\}is observable, while room\-entry and room\-exit events, such as\+in\_room\(character\)or\-in\_room\(character\), are observable to all characters; and a transient eventet=τte\_\{t\}=\\tau\_\{t\}is observable whenaia\_\{i\}is a participant in the event, such as being the speaker or listener inprivate\_tell\(speaker,listener,fact\)\.
##### Partial Observation Completion\.
To model what characteraia\_\{i\}continues to believe for unobserved states,RecToMcompletes the partial observation sequence ofaia\_\{i\}\. In detail,RecToMmaps the partial observation sequenceO\(ai\|P\(⋅\)\)=\{\(s¯t,e¯t\)\}t=1TO^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}=\\\{\(\\bar\{s\}\_\{t\},\\bar\{e\}\_\{t\}\)\\\}\_\{t=1\}^\{T\}into the perspective ofaia\_\{i\}:
P\(ai\|P\(⋅\)\)=\{\(s^t,e¯t\)\}t=1T,P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}=\\\{\(\\hat\{s\}\_\{t\},\\bar\{e\}\_\{t\}\)\\\}\_\{t=1\}^\{T\},where the event componente¯t\\bar\{e\}\_\{t\}is preserved and the state components¯t\\bar\{s\}\_\{t\}is completed ass^t\\hat\{s\}\_\{t\}:
s^t=\{s¯t,s¯t≠∅,fcom\(s^t−1,e¯t\),s¯t=∅\.\\hat\{s\}\_\{t\}=\\begin\{cases\}\\bar\{s\}\_\{t\},&\\bar\{s\}\_\{t\}\\neq\\emptyset,\\\\ f\_\{\\mathrm\{com\}\}\(\\hat\{s\}\_\{t\-1\},\\bar\{e\}\_\{t\}\),&\\bar\{s\}\_\{t\}=\\emptyset\.\\end\{cases\}\(2\)Here, the completion functionfcomf\_\{\\mathrm\{com\}\}is applied when state is unobservable, i\.e\.,s¯t=∅\\bar\{s\}\_\{t\}=\\emptyset\. It inherits the preceding completed state as the basis of the character’s belief to preserve belief continuity, and revises the inherited state withe¯t\\bar\{e\}\_\{t\}when it is observable, i\.e\.,e¯t≠∅\\bar\{e\}\_\{t\}\\neq\\emptyset, to account for belief updates\. This follows the belief\-persistence assumption that a character’s belief remains unchanged unless revised by observable evidence, consistent with prior workHuet al\.\([2023](https://arxiv.org/html/2606.11724#bib.bib34)\); Goldman and Pappas \([1979](https://arxiv.org/html/2606.11724#bib.bib35)\)\. In practice,RecToMupdatess^t−1\\hat\{s\}\_\{t\-1\}withe¯t\\bar\{e\}\_\{t\}using Eq\. \([1](https://arxiv.org/html/2606.11724#S3.E1)\) whene¯t=πt\\bar\{e\}\_\{t\}=\\pi\_\{t\}, and revisess^t−1\\hat\{s\}\_\{t\-1\}according to the task\-specified transient\-event rules whene¯t=τt\\bar\{e\}\_\{t\}=\\tau\_\{t\}\.
##### Recursive Perspective Generation
For a belief question of orderK\>0K\>0, the global perspectiveP\(∗\|G\)P^\{\(\*\|G\)\}serves as the source perspective for recursively constructing character perspectives along the character chainC\(q\)C\(q\)\. Each constructed perspective then serves as the source for the next character, thereby modeling nested beliefs relative to the preceding perspective, as shown inStep 2of Figure[2](https://arxiv.org/html/2606.11724#S2.F2)\.
### 3\.3Answer Inference
Recursive perspective construction reduces first\-order and higher\-order questions to zero\-order questions within the final constructed perspective\. Under the closed\-world assumption,RecToManswers the reduced question using the final state of this perspective, which encodes the innermost character’s belief conditioned on the preceding perspectives inC\(q\)C\(q\), as illustrated inStep 3of Figure[2](https://arxiv.org/html/2606.11724#S2.F2)
1
Input :Event sequence
E=\(e1,…,eT\)E=\(e\_\{1\},\\ldots,e\_\{T\}\); ToM question
qq; LLM
ℳ\\mathcal\{M\}
Output :Predicted answer
y^\\hat\{y\}
2
/\*Global Perspective Construction\*/
3
//Fact\-based event abstraction and event\-type classification
4
5
ℱ=\{\(Δt\+,Δt−,et\),et∈\{πt,τt\}\}t=1T←ℳ\(E\)\\mathcal\{F\}=\\\{\(\\Delta\_\{t\}^\{\+\},\\Delta\_\{t\}^\{\-\},e\_\{t\}\),\\ e\_\{t\}\\in\\\{\\pi\_\{t\},\\tau\_\{t\}\\\}\\\}\_\{t=1\}^\{T\}\\leftarrow\\mathcal\{M\}\(E\);
//Construct the global perspective
6
P\(∗\|G\)←BuildGlobalPerspective\(ℱ\)P^\{\(\*\|G\)\}\\leftarrow\\mathrm\{BuildGlobalPerspective\}\(\\mathcal\{F\}\);
7
/\*Recursive Perspective Construction\*/
8
//Extract the character chain
9
C\(q\)=\(a1,…,aK\)←ℳ\(q\)C\(q\)=\(a\_\{1\},\\ldots,a\_\{K\}\)\\leftarrow\\mathcal\{M\}\(q\);
10
//
C\(q\)=∅C\(q\)=\\emptysetindicates a zero\-order question
11if*C\(q\)=∅C\(q\)=\\emptyset*then
//Answer from the global perspective
12
y^←ℳ\(q,FinalState\(P\(∗\|G\)\)\)\\hat\{y\}\\leftarrow\\mathcal\{M\}\(q,\\mathrm\{FinalState\}\(P^\{\(\*\|G\)\}\)\);
13return
y^\\hat\{y\};
14
15
//Initialize source perspective
16
P\(⋅\)←P\(∗\|G\)P^\{\(\\cdot\)\}\\leftarrow P^\{\(\*\|G\)\};
17
18foreach*ai∈C\(q\)a\_\{i\}\\in C\(q\)*do
19
//Build partial observation sequence
20
O\(ai\|P\(⋅\)\)←ℳ\(ai,P\(⋅\)\)O^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}\\leftarrow\\mathcal\{M\}\(a\_\{i\},P^\{\(\\cdot\)\}\);
21
//Complete partial observation
22
P\(ai\|P\(⋅\)\)←Complete\(O\(ai\|P\(⋅\)\),ℳ\)P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}\\leftarrow\\mathrm\{Complete\}\(O^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\},\\mathcal\{M\}\);
23
//Update source perspective
24
P\(⋅\)←P\(ai\|P\(⋅\)\)P^\{\(\\cdot\)\}\\leftarrow P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\};
25
26
/\*Answer Inference\*/
27
//Reduce to a zero\-order question
28
q\(0\)←ℳ\(q,C\(q\)\)q^\{\(0\)\}\\leftarrow\\mathcal\{M\}\(q,C\(q\)\);
29
//Answer from the final perspective
30
y^←ℳ\(q\(0\),FinalState\(P\(⋅\)\)\)\\hat\{y\}\\leftarrow\\mathcal\{M\}\(q^\{\(0\)\},\\mathrm\{FinalState\}\(P^\{\(\\cdot\)\}\)\);
31
32return
y^\\hat\{y\};
Algorithm 1Pseudocode ofRecToM
### 3\.4Pseudocode ofRecToM
Algorithm[1](https://arxiv.org/html/2606.11724#algorithm1)summarizes the full procedure ofRecToM\. The LLMℳ\\mathcal\{M\}is used for natural\-language interpretation and semantic reasoning, including event abstraction and event\-type classification \(Line 1\), character\-chain extraction \(Line 3\), observability judgment \(Line 9\), transient\-event belief revision withinComplete\(⋅\)\\mathrm\{Complete\}\(\\cdot\)\(Line 10\), question reduction \(Line 12\), and answer inference \(Lines 5 and 13\)\. In contrast, global state accumulation from persistent events during global\-perspective construction \(Line 2\) and persistent\-event updates insideComplete\(⋅\)\\mathrm\{Complete\}\(\\cdot\)\(Line 10\) follow the deterministic update rule in Eq\. \([1](https://arxiv.org/html/2606.11724#S3.E1)\)\.RecToMcombines the semantic reasoning capacity of LLMs with deterministic updates to improve the reliability of belief persistence, belief revision, and recursive perspective construction in ToM reasoning\.
## 4KD45 Analysis ofRecToM
We use KD45, a standard modal logic for belief, to show thatRecToM’s perspective construction induces a well\-formed belief modality beyond simple event filtering\(Malcolm,[1952](https://arxiv.org/html/2606.11724#bib.bib36); Faginet al\.,[2004](https://arxiv.org/html/2606.11724#bib.bib37)\)\. In this analysis,KKrequires a constructed perspective to support standard logical inference,DDrequires internal consistency, and44and55concern the preservation of beliefs and non\-beliefs under repeated perspective construction for the same character\. For concision, we provide a proof sketch here and show the full proof in Appendix[B](https://arxiv.org/html/2606.11724#A2)\.
Letφ\\varphiandψ\\psidenote belief\-query formulas, which may be symbolic facts or nested belief statements\. For a characteraia\_\{i\}, letBaiφB\_\{a\_\{i\}\}\\varphidenote thataia\_\{i\}believesφ\\varphi\.RecToMinterpretsBaiφB\_\{a\_\{i\}\}\\varphiby constructingaia\_\{i\}’s perspective,P\(ai∣P\(⋅\)\)P^\{\(a\_\{i\}\\mid P^\{\(\\cdot\)\}\)\}, from a source perspectiveP\(⋅\)P^\{\(\\cdot\)\}, and evaluatingφ\\varphiinsideP\(ai∣P\(⋅\)\)P^\{\(a\_\{i\}\\mid P^\{\(\\cdot\)\}\)\}:
P\(⋅\)⊧BaiφiffP\(ai∣P\(⋅\)\)⊧φ\.P^\{\(\\cdot\)\}\\models B\_\{a\_\{i\}\}\\varphi\\quad\\text\{iff\}\\quad P^\{\(a\_\{i\}\\mid P^\{\(\\cdot\)\}\)\}\\models\\varphi\.\(3\)Thus,φ\\varphiis evaluated in the constructed character perspective rather than in the source perspective\.
### 4\.1Self\-Perspective Idempotence
The key property for proving KD45 is that the perspective is stable under repeated construction for the same character:
P\(ai\|P\(ai\|P\(⋅\)\)\)=P\(ai\|P\(⋅\)\)\.P^\{\(a\_\{i\}\|P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}\)\}=P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}\.\(4\)RecToMsatisfies this property because reconstructing the partial observation sequence ofaia\_\{i\}fromP\(ai\|P\(⋅\)\)P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}yields the same partial observation sequence, so the subsequent completion remains the same and produces the same perspective\.
### 4\.2KD45 Satisfaction
ForRecToM,KKholds because the constructed perspective supports standard logical inference \(ifφ→ψ\\varphi\\rightarrow\\psiandφ\\varphihold thenψ\\psiholds\)\. According to Eq\.[3](https://arxiv.org/html/2606.11724#S4.E3),Bai\(φ→ψ\)B\_\{a\_\{i\}\}\(\\varphi\\rightarrow\\psi\)andBaiφB\_\{a\_\{i\}\}\\varphiholding inP\(⋅\)P^\{\(\\cdot\)\}implies thatφ→ψ\\varphi\\rightarrow\\psiandφ\\varphihold inP\(ai\|P\(⋅\)\)P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}; henceψ\\psiholds inP\(ai\|P\(⋅\)\)P^\{\(a\_\{i\}\|P^\{\(\\cdot\)\}\)\}, thereforeBaiψB\_\{a\_\{i\}\}\\psiholds inP\(⋅\)P^\{\(\\cdot\)\}\.DDholds because each constructed perspective maintains the consistent symbolic state under the closed\-world assumption, so a fact and its negation cannot both hold\.44and55follow from Eq\. \([4](https://arxiv.org/html/2606.11724#S4.E4)\): reconstructing the same character’s perspective returns the same perspective, so both beliefs and non\-beliefs are preserved under repeated construction\. Therefore, the belief modality induced byRecToM’s perspective construction satisfies KD45\.
ModelMethodHi\-ToM0th\-Order1st\-Order2nd\-Order3rd\-Order4th\-OrderOverallGPT\-5\.4CoT100\.0088\.7560\.0068\.7571\.2577\.75SimToM100\.0092\.5088\.7587\.5080\.0089\.75TimeToM100\.0092\.5090\.0087\.5081\.2590\.25RecToM\(ours\)100\.00100\.00100\.00100\.00100\.00100\.00Gemini\-3CoT100\.00100\.0092\.5085\.0083\.7592\.25SimToM100\.0091\.2590\.0088\.7586\.2591\.25TimeToM100\.0078\.7586\.2586\.2587\.5087\.75RecToM\(ours\)100\.00100\.0097\.5097\.5097\.5098\.50Qwen3\.5CoT100\.0095\.0088\.7583\.7585\.0090\.50SimToM100\.0087\.5090\.0090\.0086\.2590\.75TimeToM100\.0070\.0087\.5086\.2582\.5085\.25RecToM\(ours\)100\.00100\.00100\.00100\.00100\.00100\.00Gemma\-4CoT100\.0093\.7565\.0062\.5058\.7576\.00SimToM98\.7591\.2581\.2573\.7573\.7583\.75TimeToM97\.5070\.0060\.0063\.7560\.0070\.25RecToM\(ours\)100\.00100\.00100\.00100\.0097\.5099\.50
Table 1:Accuracy \(%\) on Hi\-ToM across question orders and overall performance\. Each order contains 80 instances, with 400 tasks in total\. The best result for each backbone and metric is in bold\. The second\-best result is underlined\.
## 5Experiment
We first evaluateRecToMon Hi\-ToM, a benchmark designed to assess higher\-order ToM reasoning over narrative event sequences \(Figure[1](https://arxiv.org/html/2606.11724#S2.F1)\)\. We select 400 tasks from Hi\-ToM, covering question orders from zero to four, with 80 instances per order\. Each task requires selecting the correct answer from 15 candidate choices\. We compareRecToMagainst current state\-of\-the\-art approaches: Chain\-of\-Thought prompting \(CoT\),SimToM, andTimeToM\. To examine robustness across model families and scales, we evaluate all methods with multiple LLM backbones: proprietary models GPT\-5\.4OpenAI \([2026](https://arxiv.org/html/2606.11724#bib.bib38)\)and Gemini\-3\-FlashDeepMind \([2025](https://arxiv.org/html/2606.11724#bib.bib39)\), the open\-source dense model Qwen3\.5\-27BTeam \([2026](https://arxiv.org/html/2606.11724#bib.bib40)\), and the open\-source Mixture\-of\-Experts model Gemma\-4\-26B\-A4BDeepMind \([2026](https://arxiv.org/html/2606.11724#bib.bib41)\)\. See Appendix[C](https://arxiv.org/html/2606.11724#A3)for parameter settings\.
ModelMethodBig\-ToMFanToMOverall1st\-Order2nd\-OrderOverallGPT\-5\.4CoT99\.0088\.9586\.7187\.96SimToM99\.0091\.1695\.8093\.21TimeToM97\.7592\.2792\.3192\.28RecToM\(ours\)99\.5096\.1399\.3097\.53Gemini\-3CoT98\.7577\.3590\.2183\.02SimToM98\.0082\.3289\.5185\.49TimeToM92\.5074\.5991\.6182\.10RecToM\(ours\)99\.0091\.1691\.6191\.36Qwen3\.5CoT96\.2578\.4585\.3181\.48SimToM94\.7589\.5081\.8286\.11TimeToM91\.0086\.7485\.3186\.11RecToM\(ours\)98\.5092\.8286\.0189\.81Gemma\-4CoT93\.2563\.5475\.5268\.83SimToM88\.0071\.2780\.4275\.31TimeToM80\.0076\.2479\.7277\.78RecToM\(ours\)99\.0087\.2984\.6286\.11
Table 2:Accuracy \(%\) on Big\-ToM and FanToM\. Big\-ToM contains 400 instances, while FanToM contains 181 first\-order and 143 second\-order instances, with 324 instance in total\. The best result for each backbone and metric is in bold\. The second\-best result is underlined\.### 5\.1Results on Hi\-Tom
Table[1](https://arxiv.org/html/2606.11724#S4.T1)reports Hi\-ToM accuracy for zero\-order to fourth\-order questions, together with overall performance\.RecToMachieves the highest overall accuracy across all LLM backbones, reaching 100\.00% with GPT\-5\.4 and Qwen3\.5, 98\.50% with Gemini\-3, and 99\.50% with Gemma\-4, demonstrating state\-of\-the\-art performance\. Compared with the strongest baseline for each backbone,RecToMyields absolute overall accuracy gains of 9\.75%, 6\.25%, 9\.25%, and 15\.75%, respectively, with the largest improvement overSimToMusing Gemma\-4\. The order\-wise results show that all methods perform strongly on zero\-order questions, whereas the baselines generally degrade as the order increases, reflecting the difficulty of inferring nested character beliefs under asymmetric information\. In contrast,RecToMmaintains near\-perfect accuracy across question orders, with the lowest order\-wise accuracy still reaching 97\.50% on Gemini\-3 and Gemma\-4\. The few remaining errors arise from LLM semantic interpretation failures, such as incorrect observability identification\. These results indicate that recursive perspective construction provides a robust mechanism for higher\-order ToM reasoning and demonstrate the model\-independent nature ofRecToM\.
### 5\.2Generalization across ToM Scenarios
We further evaluateRecToMon benchmarks, Big\-ToM and FanToM, under the same experimental settings as Table[1](https://arxiv.org/html/2606.11724#S4.T1)to examine its generality across different ToM scenarios\. Similar to Hi\-ToM, Big\-ToM follows the Sally–Anne paradigm, while presents stories in more natural language and extends beyond object\-location changes\. Following prior workWilfet al\.\([2024](https://arxiv.org/html/2606.11724#bib.bib15)\); Houet al\.\([2024](https://arxiv.org/html/2606.11724#bib.bib17)\), we evaluate 400 forward\-belief questions from Big\-ToM, consisting of 200 false\-belief and 200 true\-belief first\-order questions\. FanTom evaluates ToM reasoning in interactive dialogue scenarios, where characters enter and leave ongoing conversations, creating asymmetric information access and distinct mental statesQuesque and Rossetti \([2020](https://arxiv.org/html/2606.11724#bib.bib42)\)\. We evaluate 324 FanTom belief questions, including 181 first\-order and 143 second\-order questions\. Detailed benchmark descriptions for Big\-ToM and FanToM are provided in Appendix[D](https://arxiv.org/html/2606.11724#A4)\.
The results are reported in Table[2](https://arxiv.org/html/2606.11724#S5.T2)\.RecToMachieves the highest overall accuracy for every backbone on both datasets, reaching up to 99\.50% on Big\-ToM and 97\.53% on FanToM, with the largest absolute gain of 8\.33% over the second\-best method on FanToM with Gemma\-4\. It also obtains the best or tied\-best order\-specific accuracy on FanToM\. These results show thatRecToMgeneralizes from narrative event sequences to interactive dialogue scenarios\. ForRecToM, errors in FanToM mainly reflect incorrect semantic grounding by LLMs\. For example, partial exposure, where a character enters the conversation late and hears only the “tail end” of a conversation, is incorrectly treated as access to the full preceding dialogue\. Moreover, when later utterances are semantically related to earlier ones, LLMs may incorrectly assume that a character who joined later also knows information mentioned before\.
ModelMethodHi\-ToMBig\-ToMFanToMTokEffTokEffTokEffGPT\-5\.4CoT0\.6K–0\.3K–0\.9K–SimToM2\.3K7\.30\.5K0\.04\.0K1\.7TimeToM3\.0K5\.41\.1K\-1\.55\.4K1\.0RecToM\(ours\)6\.6K3\.83\.4K0\.28\.7K1\.2Gemini\-3CoT1\.1K–0\.4K–1\.1K–SimToM3\.7K\-0\.41\.2K\-1\.05\.7K0\.5TimeToM7\.5K\-0\.73\.0K\-2\.49\.4K\-0\.1RecToM\(ours\)9\.4K0\.74\.8K0\.110\.2K0\.9
Table 3:Cost analysis in the proprietary backbones GPT\-5\.4 and Gemini\-3\. Tok denotes the average token usage per question\. K is10310^\{3\}\. Eff denotes token efficiency relative to CoT, computed asΔ\\DeltaAcc/Δ\\DeltaTok, whereΔ\\DeltaAcc andΔ\\DeltaTok are the overall accuracy gain and additional token usage over CoT, respectively\. Higher values indicate better token efficiency\. Negative values indicate that a method consumes more tokens than CoT while achieving lower overall accuracy\.
### 5\.3Cost Analysis
Table[3](https://arxiv.org/html/2606.11724#S5.T3)compares the average token usage per problem \(Tok\) and token efficiency \(Eff\) relative to CoT across the evaluated approaches using GPT\-5\.4 and Gemini\-3 as backbones\.RecToMshows higher token usage\. However, when evaluated by token efficiency, measured as the overall accuracy gain obtained from each additional 1K tokens relative to CoT,RecToMachieves the highest efficiency across all Gemini\-3 settings and on Big\-ToM with GPT\-5\.4\. It is lower thanSimToMon FanToM with GPT\-5\.4, while efficiency remains comparable \(1\.7 vs\. 1\.2\)\.RecToMis less token\-efficient on Hi\-ToM with GPT\-5\.4, where higher\-order questions require deeper perspective construction\. However, this additional computation supports near\-perfect accuracy across all LLM backbones on this challenging benchmark, as reported in Table[1](https://arxiv.org/html/2606.11724#S4.T1)\.
BackboneMethodHi\-ToMBig\-ToMFanToMAccΔ\\DeltaAccΔ\\DeltaAccΔ\\DeltaGPT\-5\.4RecToM100\.00–99\.50–97\.53–w/o\-det100\.000\.0099\.500\.0097\.530\.00w/o\-state93\.007\.0099\.250\.2594\.443\.09Gemini\-3RecToM98\.50–99\.00–91\.36–w/o\-det98\.000\.5099\.000\.0090\.430\.93w/o\-state94\.254\.2598\.750\.2587\.653\.70Qwen3\.5RecToM100\.00–98\.50–89\.81–w/o\-det100\.000\.0098\.500\.0089\.500\.31w/o\-state93\.506\.5097\.251\.2587\.352\.47Gemma\-4RecToM99\.50–99\.00–86\.11–w/o\-det98\.001\.5098\.750\.2585\.190\.93w/o\-state88\.2511\.2595\.253\.7580\.865\.25
Table 4:Ablation study ofRecToM\.w/o\-det replaces deterministic state updates with LLM\-based state updating\.w/o\-state removes symbolic state construction and maintains cumulative observable event sequences in each perspective\. Acc denotes overall accuracy \(%\)\.Δ\\Deltadenotes the accuracy decrease relative toRecToM\.
### 5\.4Ablation Study
To examine the contribution of deterministic state updates and symbolic state representation, we compareRecToMwith two ablated variants\. Thew/o\-det variant retains symbolic states, while replacing deterministic state updates with LLM\-based execution under the same update rule in Eq\.[1](https://arxiv.org/html/2606.11724#S3.E1)\. Thew/o\-state variant removes state representation from each perspective while preserving recursive perspective construction alongC\(q\)C\(q\)\. Instead of constructing perspectives through observation\-based completion over state\-event pairs, it maintains a cumulative observable event history at each step,E¯t=\{e¯1,…,e¯t\}\\bar\{E\}\_\{t\}=\\\{\\bar\{e\}\_\{1\},\\ldots,\\bar\{e\}\_\{t\}\\\}, yielding the perspectiveP~=\{E¯t\}t=1T\\tilde\{P\}=\\\{\\bar\{E\}\_\{t\}\\\}\_\{t=1\}^\{T\}\.
Table[4](https://arxiv.org/html/2606.11724#S5.T4)reports the ablation results\. Thew/o\-det variant leads to slight performance degradation, suggesting that deterministic updates reduce variability in belief\-state maintenance\. Removing symbolic state representation, thew/o\-state variant, consistently reduces performance across all benchmarks and backbones, with the largest degradation of 11\.25% recorded on Hi\-ToM using Gemma\-4\. These results demonstrate that fact\-based state representations provide a more reliable basis for belief reasoning than cumulative observable event histories\. Thew/o\-state variant requires the LLM to implicitly infer the observability from accumulated event semantics, as well as belief persistence and revision across observed and unobserved events, which can lead to incorrect observability judgments, inconsistent beliefs, and erroneous answer inference\. In contrast,RecToMexplicitly constructs fact\-based states after each event, encoding belief\-relevant conditions through ontic fact updates\. The fact\-based state representation supports accurate observability evaluation via explicit character presence and location facts and enables answer inference directly from ontic facts in the constructed perspective\. These advantages are evident in higher\-order reasoning on Hi\-ToM and FanToM, where errors can propagate through recursively constructed perspectives\.
## 6Conclusion
We introducedRecToM, an inference\-time framework for ToM reasoning\.RecToMmodels nested beliefs by recursively constructing character perspectives, where each constructed perspective becomes the source for constructing the next character’s perspective\. This reduces higher\-order belief questions to actual\-world questions within the perspective of the innermost character specified by the question\. We further provided a KD45 analysis showing thatRecToM’s perspective construction induces a well\-formed belief structure beyond simple event filtering\. Experiments on Hi\-ToM, Big\-ToM, and FanToM benchmarks demonstrate thatRecToMconsistently outperforms recent advanced approaches, across multiple LLM backbones, achieving state\-of\-the\-art performance with the strongest gains on higher\-order ToM questions\.
## 7Limitations
RecToMis designed for controlled text\-based ToM benchmarks where event\-transition rules and observability assumptions are specified by the task description\. While this setting covers higher\-order beliefs, asymmetric information, and dialogue\-based belief reasoning, extendingRecToMto open\-ended or multimodal environments may require additional grounding of implicit observations and event\-transition rules\. Moreover,RecToMexplicitly constructs character perspectives before deriving the final answer, which introduces additional inference\-time computation\. Future work can reduce this cost through prompt compression, caching shared perspective states, or lightweight symbolic abstraction\.
## References
- The logic of public announcements, common knowledge, and private suspicions\.InProceedings of the 7th Conference on Theoretical Aspects of Rationality and Knowledge,TARK ’98,San Francisco, CA, USA,pp\. 43–56\.External Links:ISBN 1558605630Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p5.1)\.
- S\. Baron\-Cohen, A\. M\. Leslie, and U\. Frith \(1985\)Does the autistic child have a “theory of mind”?\.Cognition21\(1\),pp\. 37–46\.Cited by:[§1](https://arxiv.org/html/2606.11724#S1.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p2.1)\.
- B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini \(2024\)Large language monkeys: scaling inference compute with repeated sampling\.External Links:2407\.21787,[Link](https://arxiv.org/abs/2407.21787)Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p5.1)\.
- Z\. Chen, J\. Wu, J\. Zhou, B\. Wen, G\. Bi, G\. Jiang, Y\. Cao, M\. Hu, Y\. Lai, Z\. Xiong, and M\. Huang \(2024\)ToMBench: benchmarking theory of mind in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 15959–15983\.External Links:[Link](https://aclanthology.org/2024.acl-long.847/)Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p2.1)\.
- G\. DeepMind \(2025\)Gemini 3 developer guide\.Note:Accessed: 2026\-05\-21External Links:[Link](https://ai.google.dev/gemini-api/docs/gemini-3)Cited by:[§5](https://arxiv.org/html/2606.11724#S5.p1.1)\.
- G\. DeepMind \(2026\)Gemma 4\.Note:Accessed: 2026\-05\-21External Links:[Link](https://deepmind.google/models/gemma/gemma-4/)Cited by:[§5](https://arxiv.org/html/2606.11724#S5.p1.1)\.
- R\. Fagin, J\. Y\. Halpern, Y\. Moses, and M\. Vardi \(2004\)Reasoning about knowledge\.MIT press\.Cited by:[§4](https://arxiv.org/html/2606.11724#S4.p1.4)\.
- K\. Gandhi, J\. Fränken, T\. Gerstenberg, and N\. D\. Goodman \(2023\)Understanding social reasoning in language models with language models\.InProceedings of the 37th International Conference on Neural Information Processing Systems,NeurIPS,Red Hook, NY, USA\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p2.1),[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1),[§D\.1](https://arxiv.org/html/2606.11724#A4.SS1.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p6.1)\.
- A\. Goldman and G\. Pappas \(1979\)Justification and knowledge\.Reidel, chapter What is Justified Belief,pp\. 1–23\.Cited by:[§3\.2\.2](https://arxiv.org/html/2606.11724#S3.SS2.SSS2.Px2.p1.16)\.
- G\. Hou, W\. Zhang, Y\. Shen, L\. Wu, and W\. Lu \(2024\)TimeToM: temporal space is the key to unlocking the door of large language models’ theory\-of\-mind\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 11532–11547\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p4.1),[§1](https://arxiv.org/html/2606.11724#S1.p3.1),[§5\.2](https://arxiv.org/html/2606.11724#S5.SS2.p1.1)\.
- G\. Hu, T\. Miller, and N\. Lipovetzky \(2023\)Planning with multi\-agent belief using justified perspectives\.InProceedings of the International Conference on Automated Planning and Scheduling,Vol\.33,pp\. 180–188\.Cited by:[§3\.2\.2](https://arxiv.org/html/2606.11724#S3.SS2.SSS2.Px2.p1.16)\.
- C\. Jung, D\. Kim, J\. Jin, J\. Kim, Y\. Seonwoo, Y\. Choi, A\. Oh, and H\. Kim \(2024\)Perceptions to beliefs: exploring precursory inferences for theory of mind in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 19794–19809\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p4.1)\.
- H\. Kim, M\. Sclar, X\. Zhou, R\. Bras, G\. Kim, Y\. Choi, and M\. Sap \(2023\)FANToM: a benchmark for stress\-testing machine theory of mind in interactions\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 14397–14413\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p2.1),[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1),[§D\.2](https://arxiv.org/html/2606.11724#A4.SS2.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p6.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1)\.
- M\. Le, Y\. Boureau, and M\. Nickel \(2019\)Revisiting the evaluation of theory of mind through question answering\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 5872–5877\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p1.1)\.
- N\. Malcolm \(1952\)Knowledge and belief\.Mind61\(242\),pp\. 178–189\.Cited by:[§4](https://arxiv.org/html/2606.11724#S4.p1.4)\.
- OpenAI \(2026\)Introducing gpt\-5\.4\.Note:Accessed: 2026\-05\-21External Links:[Link](https://openai.com/index/introducing-gpt-5-4/)Cited by:[§5](https://arxiv.org/html/2606.11724#S5.p1.1)\.
- J\. Perner and H\. Wimmer \(1985\)“John thinks that mary thinks that…” attribution of second\-order beliefs by 5\-to 10\-year\-old children\.Journal of experimental child psychology39\(3\),pp\. 437–471\.Cited by:[§1](https://arxiv.org/html/2606.11724#S1.p2.1)\.
- D\. Premack and G\. Woodruff \(1978\)Does the chimpanzee have a theory of mind?\.Behavioral and Brain Sciences1\(4\),pp\. 515–526\.Cited by:[§1](https://arxiv.org/html/2606.11724#S1.p1.1)\.
- F\. Quesque and Y\. Rossetti \(2020\)What do theory\-of\-mind tasks actually measure? theory and practice\.Perspectives on psychological science15\(2\),pp\. 384–396\.Cited by:[§5\.2](https://arxiv.org/html/2606.11724#S5.SS2.p1.1)\.
- N\. Rabinowitz, F\. Perbet, F\. Song, C\. Zhang, S\. A\. Eslami, and M\. Botvinick \(2018\)Machine theory of mind\.InInternational conference on machine learning,pp\. 4218–4227\.Cited by:[§1](https://arxiv.org/html/2606.11724#S1.p1.1)\.
- M\. Sap, R\. Le Bras, D\. Fried, and Y\. Choi \(2022\)Neural theory\-of\-mind? on the limits of social intelligence in large lms\.InProceedings of the 2022 conference on empirical methods in natural language processing,pp\. 3762–3780\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p1.1)\.
- M\. Sap, H\. Rashkin, D\. Chen, R\. Le Bras, and Y\. Choi \(2019\)Social iqa: commonsense reasoning about social interactions\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 4463–4473\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p1.1)\.
- M\. Sclar, J\. Dwivedi\-Yu, M\. Fazel\-Zarandi, Y\. Tsvetkov, Y\. Bisk, Y\. Choi, and A\. Celikyilmaz \(2025\)Explore theory of mind: program\-guided adversarial data generation for theory of mind reasoning\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p2.1)\.
- M\. Sclar, S\. Kumar, P\. West, A\. Suhr, Y\. Choi, and Y\. Tsvetkov \(2023\)Minding language models’ \(lack of\) theory of mind: a plug\-and\-play multi\-character belief tracker\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 13960–13980\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p4.1)\.
- N\. Shapira, M\. Levy, S\. H\. Alavi, X\. Zhou, Y\. Choi, Y\. Goldberg, M\. Sap, and V\. Shwartz \(2024\)Clever hans or neural theory of mind? stress testing social reasoning in large language models\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2257–2273\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p1.1)\.
- D\. Sileo and A\. Lernould \(2023\)MindGames: targeting theory of mind in large language models with dynamic epistemic modal logic\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 4570–4577\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p5.1)\.
- C\. V\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2025\)Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p5.1)\.
- Q\. Team \(2026\)Qwen3\.5: accelerating productivity with native multimodal agents\.Note:Accessed: 2026\-05\-21External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§5](https://arxiv.org/html/2606.11724#S5.p1.1)\.
- T\. Ullman \(2023\)Large language models fail on trivial alterations to theory\-of\-mind tasks\.External Links:2302\.08399,[Link](https://arxiv.org/abs/2302.08399)Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p1.1)\.
- H\. P\. van Ditmarsch, W\. van der Hoek, and B\. Kooi \(2007\)Dynamic epistemic logic\.Synthese Library, Vol\.337,Springer,Berlin, Heidelberg\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4020-5839-4),ISBN 978\-1\-4020\-5838\-7Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p5.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1),[Appendix A](https://arxiv.org/html/2606.11724#A1.p5.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1)\.
- A\. Wilf, S\. Lee, P\. P\. Liang, and L\. Morency \(2024\)Think twice: perspective\-taking improves large language models’ theory\-of\-mind capabilities\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 8292–8308\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1),[§1](https://arxiv.org/html/2606.11724#S1.p3.1),[§5\.2](https://arxiv.org/html/2606.11724#S5.SS2.p1.1)\.
- H\. Wimmer and J\. Perner \(1983\)Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception\.Cognition13\(1\),pp\. 103–128\.Cited by:[§1](https://arxiv.org/html/2606.11724#S1.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p2.1)\.
- Y\. Wu, Y\. He, Y\. Jia, R\. Mihalcea, Y\. Chen, and N\. Deng \(2023\)Hi\-ToM: a benchmark for evaluating higher\-order theory of mind reasoning in large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 10691–10706\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p2.1),[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1),[§1](https://arxiv.org/html/2606.11724#S1.p1.1),[§1](https://arxiv.org/html/2606.11724#S1.p2.1),[§1](https://arxiv.org/html/2606.11724#S1.p6.1)\.
- Y\. Wu, J\. Xie, D\. Zhang, and Z\. Xu \(2025\)DEL\-tom: inference\-time scaling for theory\-of\-mind reasoning via dynamic epistemic logic\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Suzhou, China,pp\. 11383–11397\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.573),[Link](https://aclanthology.org/2025.emnlp-main.573/)Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p5.1)\.
- H\. Xu, S\. Qi, J\. Li, Y\. Zhou, J\. Du, C\. Catmur, and Y\. He \(2025\)EnigmaToM: improve LLMs’ theory\-of\-mind reasoning capabilities with neural knowledge base of entity states\.InFindings of the Association for Computational Linguistics: ACL 2025,Vienna, Austria,pp\. 13598–13622\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p4.1)\.
- H\. Xu, R\. Zhao, L\. Zhu, J\. Du, and Y\. He \(2024\)OpenToM: a comprehensive benchmark for evaluating theory\-of\-mind reasoning capabilities of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 8593–8623\.Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p2.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. R\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InThirty\-seventh Conference on Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2606.11724#A1.p3.1)\.
## Appendix
## Appendix ARelated Work
Early ToM benchmarks adapted classic false\-belief paradigms into text\-based question answering\. ToMi\(Leet al\.,[2019](https://arxiv.org/html/2606.11724#bib.bib30)\), for example, evaluates whether models can distinguish reality from an agent’s belief in Sally–Anne\-style narratives\. SocialIQA\(Sapet al\.,[2019](https://arxiv.org/html/2606.11724#bib.bib2)\)broadened the scope from belief tracking to everyday social commonsense, including intents, reactions, and social consequences\. However, subsequent evaluations showed that LLMs often lack robust ToM\-like reasoning\.Sapet al\.\([2022](https://arxiv.org/html/2606.11724#bib.bib28)\)found that large pretrained models underperform humans on social reasoning and false\-belief tasks\.Ullman \([2023](https://arxiv.org/html/2606.11724#bib.bib3)\)further showed that seemingly minor perturbations to classic ToM scenarios can cause dramatic failures\. Moreover,Shapiraet al\.\([2024](https://arxiv.org/html/2606.11724#bib.bib4)\)argued that apparent success on ToM\-style problems may reflect shallow heuristics rather than stable mental\-state reasoning\. These findings motivate more controlled, diverse, and leakage\-resistant evaluations of ToM in LLMs\.
Recent work has therefore developed richer ToM benchmarks that move beyond isolated first\-order false\-belief questions\. Big\-ToM\(Gandhiet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib29)\)uses causal templates to generate controlled scenarios involving percepts, beliefs, desires, and actions, enabling systematic tests of forward belief inference, forward action inference, and backward belief inference\. FanToM\(Kimet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib6)\)shifts evaluation from passive narratives to multiparty conversations in which characters enter and leave discussions, creating information asymmetries between participants\. Its results show that models may answer one question format correctly while failing logically related answerability or information\-access questions, revealing inconsistent performance across question types\. Hi\-TOM\(Wuet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib7)\)focuses on higher\-order recursive beliefs, extending evaluation to third\-order and fourth\-order ToM and incorporating public and private deceptive communication\. More recent benchmarks further broaden the evaluation landscape: ToMBench\(Chenet al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib8)\)introduces a large\-scale bilingual benchmark spanning multiple mental\-state categories and abilities; OpenToM\(Xuet al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib9)\)evaluates longer stories with richer psychological states; and ExploreToM\(Sclaret al\.,[2025](https://arxiv.org/html/2606.11724#bib.bib10)\)uses program\-guided adversarial generation to produce challenging and diverse ToM scenarios\. Together, these benchmarks show that robust ToM evaluation requires tracking event access, information asymmetry, and higher\-order recursive beliefs\.
A parallel line of work studies how to improve LLMs’ ToM reasoning\. General\-purpose reasoning methods such as chain\-of\-thought prompting\(Weiet al\.,[2022](https://arxiv.org/html/2606.11724#bib.bib11); Kojimaet al\.,[2022](https://arxiv.org/html/2606.11724#bib.bib12)\), self\-consistency\(Wanget al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib24)\), and tree\-of\-thought search\(Yaoet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib14)\)improve performance on many symbolic and mathematical reasoning tasks, but their effect on ToM is mixed\. Several ToM benchmarks report that CoT provides limited gains or can even amplify errors when models adopt an incorrect perspective or propagate a mistaken intermediate belief\(Gandhiet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib29); Kimet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib6); Wuet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib7)\)\. This suggests that ToM reasoning requires not simply longer rationales, but appropriate intermediate representations for tracking the evolution of agents’ beliefs\.SimToM\(Wilfet al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib15)\)addresses this by decomposing ToM into perspective\-taking and question answering: the model first filters the story to what the target character knows, and then answers from that character’s perspective\. This two\-stage decomposition substantially improves performance over standard prompting baselines, showing that explicitly constructing character perspectives is beneficial\.
Subsequent methods make this intermediate structure more explicit\. SymbolicToM\(Sclaret al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib16)\)introduces a plug\-and\-play multi\-character belief tracker, arguing that ToM requires explicit symbolic representations of agents’ beliefs rather than implicit language\-model inference alone\.TimeToM\(Houet al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib17)\)further emphasizes temporal structure by constructing a temporal space and per\-character Temporal Belief State Chains \(TBSCs\), distinguishing self\-world beliefs from social\-world beliefs and using shared belief\-communication periods to support higher\-order reasoning\. PercepToM\(Junget al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib18)\)decomposes ToM into perception inference and perception\-to\-belief inference, showing that LLMs may identify what an agent can perceive while still failing to convert that perception into the correct belief state\. EnigmaToM\(Xuet al\.,[2025](https://arxiv.org/html/2606.11724#bib.bib19)\)extends this direction with a neuro\-symbolic entity\-state memory and iterative perspective masking for higher\-order belief tracking\. These studies collectively suggest that ToM failures often arise from insufficient tracking of belief evolution: models struggle to maintain which information each character observed and how later unobserved information should or should not revise nested beliefs\.
Formal and verifier\-based approaches provide another route to structured ToM reasoning\. Dynamic epistemic logic \(DEL\) offers a principled formalism for representing belief states, event models, and belief updates in multi\-agent settings\(Baltaget al\.,[1998](https://arxiv.org/html/2606.11724#bib.bib20); van Ditmarschet al\.,[2007](https://arxiv.org/html/2606.11724#bib.bib21)\)\. MindGames\(Sileo and Lernould,[2023](https://arxiv.org/html/2606.11724#bib.bib22)\)uses epistemic logic to generate controlled reasoning problems\. DEL\-ToM\(Wuet al\.,[2025](https://arxiv.org/html/2606.11724#bib.bib23)\)formalizes ToM as a sequence of dynamic epistemic belief updates\. DEL\-ToM trains a Process Belief Model using labels generated by a DEL simulator and applies inference\-time scaling to select high\-scoring belief traces from multiple LLM\-generated candidates\. This connects ToM reasoning to broader work on process supervision and inference\-time scaling, where voting, search, or verifiers are used to select among candidate reasoning traces\(Wanget al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib24); Brownet al\.,[2024](https://arxiv.org/html/2606.11724#bib.bib25); Snellet al\.,[2025](https://arxiv.org/html/2606.11724#bib.bib26)\)\. This line of work highlights the importance of intermediate belief\-update supervision, yet typically relies on external simulators or verifier\-based selection rather than direct inference\-time perspective construction\. In contrast,RecToMconstructs fact\-based symbolic perspectives directly at inference time, recursively deriving each character perspective from the preceding perspective to evaluate nested beliefs under asymmetric information\.
ModelMethodBig\-ToM Forward BeliefTrue\-beliefFalse\-beliefOverallGPT\-5\.4CoT98\.5099\.5099\.00SimToM99\.5098\.5099\.00TimeToM96\.0099\.5097\.75RecToM99\.5099\.5099\.50Gemini\-3CoT98\.5099\.0098\.75SimToM98\.0098\.0098\.00TimeToM92\.0093\.0092\.50RecToM99\.0099\.0099\.00Qwen3\.5CoT93\.0099\.5096\.25SimToM92\.5097\.0094\.75TimeToM84\.0098\.0091\.00RecToM98\.0099\.0098\.50Gemma\-4CoT88\.0098\.5093\.25SimToM78\.5097\.5088\.00TimeToM69\.0091\.0080\.00RecToM98\.5099\.5099\.00
Table 5:Accuracy \(%\) on Big\-ToM forward\-belief questions\. Big\-ToM contains 400 forward\-belief instances, with 200 true\-belief and 200 false\-belief instances\. The best result for each backbone and metric is in bold, and the second\-best result is underlined\.
## Appendix BKD45 Proof forRecToM
We provide the full derivation of the KD45 axioms for the belief modality induced byRecToM’s recursive perspective construction\. The proof has three steps\. First, we define the language of belief queries\. Second, we define how such queries are evaluated byRecToM\. Third, we show that this evaluation satisfies theKK,DD,44, and55axioms of KD45\.
Let𝒲\\mathcal\{W\}be the set of well\-formed perspectives\. In this proof, we simplify the notation and writePPfor an arbitrary perspective, e\.g\.,P∈𝒲P\\in\\mathcal\{W\}\. We assume that𝒲\\mathcal\{W\}is closed underRecToM’s perspective construction: for each characteraia\_\{i\}, the constructed perspectiveP\(ai∣P\)P^\{\(a\_\{i\}\\mid P\)\}also belongs to𝒲\\mathcal\{W\}\. For readability, defineFai\(P\)=P\(ai∣P\)F\_\{a\_\{i\}\}\(P\)=P^\{\(a\_\{i\}\\mid P\)\}\. Thus,FaiF\_\{a\_\{i\}\}maps any source perspectivePPto the constructed perspectiveP\(ai∣P\)P^\{\(a\_\{i\}\\mid P\)\}of characteraia\_\{i\}\.
Given an atomic symbolic factxxand a set of characters𝒜\\mathcal\{A\}, we define belief\-query formulas using the grammar
φ,ψ::=x∣¬φ∣\(φ→ψ\)∣Bajφ,aj∈𝒜\.\\varphi,\\psi::=x\\mid\\neg\\varphi\\mid\(\\varphi\\rightarrow\\psi\)\\mid B\_\{a\_\{j\}\}\\varphi,\\quad a\_\{j\}\\in\\mathcal\{A\}\.Here,BajφB\_\{a\_\{j\}\}\\varphimeans that characteraja\_\{j\}believesφ\\varphi\. This grammar includes only negation, implication, and belief as primitive operators\. Other Boolean connectives can be introduced as abbreviations in the usual classical way; for example,φ∨ψ\\varphi\\vee\\psiabbreviates¬φ→ψ\\neg\\varphi\\rightarrow\\psi, andφ∧ψ\\varphi\\wedge\\psiabbreviates¬\(φ→¬ψ\)\\neg\(\\varphi\\rightarrow\\neg\\psi\)\. Thus, the grammar remains compact while still allowing standard Boolean combinations of belief queries\.
In the following proof, we fix an arbitrary characteraia\_\{i\}and show thatBaiB\_\{a\_\{i\}\}satisfies the KD45 axioms\. We writeP⊧φP\\models\\varphito mean that formulaφ\\varphiis satisfied, or evaluates to true, under perspectivePP\. For an atomic symbolic factxx,P⊧xP\\models xiffxxholds in the final state ofPP; ifxxis absent from the final state, it is treated as false under the closed\-world assumption\. The primitive Boolean connectives are interpreted classically:P⊧¬φP\\models\\neg\\varphiiffP⊧̸φP\\not\\models\\varphi, andP⊧φ→ψP\\models\\varphi\\rightarrow\\psiiff eitherP⊧̸φP\\not\\models\\varphiorP⊧ψP\\models\\psi\. Connectives such as∧\\wedgeand∨\\vee, when used, inherit their meanings through the abbreviations defined above\. The belief operator is interpreted by switching from the current perspective to the constructed perspective of the queried character:
P⊧BaiφiffFai\(P\)⊧φ\.P\\models B\_\{a\_\{i\}\}\\varphi\\quad\\text\{iff\}\\quad F\_\{a\_\{i\}\}\(P\)\\models\\varphi\.\(5\)Thus, evaluatingBaiφB\_\{a\_\{i\}\}\\varphiinPPmeans: first constructaia\_\{i\}’s perspective, and then evaluateφ\\varphiinside that perspective\. Nested belief queries are handled by repeatedly applying the same rule\.
##### Self\-perspective idempotence\.
The key property is that constructing the same character’s perspective twice does not change the result:
Fai\(Fai\(P\)\)=Fai\(P\)\.F\_\{a\_\{i\}\}\(F\_\{a\_\{i\}\}\(P\)\)=F\_\{a\_\{i\}\}\(P\)\.\(6\)This follows from the construction rule in Eq\. \(2\)\. OnceFai\(P\)F\_\{a\_\{i\}\}\(P\)has already been constructed, the event sequence contains only events observable toaia\_\{i\}or∅\\emptyset, and every unobservable state has already been completed by inheriting the previous belief state and revising it only with observable evidence\. Therefore, reapplying the same construction does not remove any additional information or add any new information\. It preserves the same observable event sequence and returns the same completed state sequence\.
### B\.1KD45 Axiom Satisfaction
We now verify the KD45 axioms under the belief semantics in Eq\. \([5](https://arxiv.org/html/2606.11724#A2.E5)\)\. The main idea is simple:KKandDDfollow from ordinary classical reasoning inside the constructed perspectiveFai\(P\)F\_\{a\_\{i\}\}\(P\), while44and55follow from self\-perspective idempotence\.
##### K: Distribution\.
TheKKaxiom statesBai\(φ→ψ\)→\(Baiφ→Baiψ\)B\_\{a\_\{i\}\}\(\\varphi\\rightarrow\\psi\)\\rightarrow\(B\_\{a\_\{i\}\}\\varphi\\rightarrow B\_\{a\_\{i\}\}\\psi\)\. Intuitively, this axiom says that belief is closed under implication: if a character’s constructed perspective supports an implication and also supports its premise, then it should also support the conclusion\. To verify this forRecToM, supposeP⊧Bai\(φ→ψ\)P\\models B\_\{a\_\{i\}\}\(\\varphi\\rightarrow\\psi\)andP⊧BaiφP\\models B\_\{a\_\{i\}\}\\varphi\. By Eq\. \([5](https://arxiv.org/html/2606.11724#A2.E5)\), both formulas are evaluated inside the same constructed perspectiveFai\(P\)F\_\{a\_\{i\}\}\(P\), soFai\(P\)⊧φ→ψF\_\{a\_\{i\}\}\(P\)\\models\\varphi\\rightarrow\\psiandFai\(P\)⊧φF\_\{a\_\{i\}\}\(P\)\\models\\varphi\. Since implication is interpreted classically inside this perspective,Fai\(P\)⊧ψF\_\{a\_\{i\}\}\(P\)\\models\\psi\. Applying Eq\. \([5](https://arxiv.org/html/2606.11724#A2.E5)\) again givesP⊧BaiψP\\models B\_\{a\_\{i\}\}\\psi\. Therefore,KKholds\.
##### D: Consistency\.
TheDDaxiom statesBaiφ→¬Bai¬φB\_\{a\_\{i\}\}\\varphi\\rightarrow\\neg B\_\{a\_\{i\}\}\\neg\\varphi\. Intuitively, this axiom says that a character’s constructed perspective should not support both a statement and its negation\. To verify this forRecToM, supposeP⊧BaiφP\\models B\_\{a\_\{i\}\}\\varphi\. By Eq\. \([5](https://arxiv.org/html/2606.11724#A2.E5)\), this meansFai\(P\)⊧φF\_\{a\_\{i\}\}\(P\)\\models\\varphi\. SinceFai\(P\)F\_\{a\_\{i\}\}\(P\)is a well\-formed perspective and formulas are evaluated with classical negation,Fai\(P\)⊧̸¬φF\_\{a\_\{i\}\}\(P\)\\not\\models\\neg\\varphi\. HenceP⊧̸Bai¬φP\\not\\models B\_\{a\_\{i\}\}\\neg\\varphi, soP⊧¬Bai¬φP\\models\\neg B\_\{a\_\{i\}\}\\neg\\varphi\. Therefore,DDholds\.
##### 4: Positive introspection\.
The44axiom statesBaiφ→BaiBaiφB\_\{a\_\{i\}\}\\varphi\\rightarrow B\_\{a\_\{i\}\}B\_\{a\_\{i\}\}\\varphi\. Intuitively, this axiom says that if a character’s constructed perspective supportsφ\\varphi, then recursively asking what the same character believes should not change that perspective\. To verify this forRecToM, supposeP⊧BaiφP\\models B\_\{a\_\{i\}\}\\varphi\. ThenFai\(P\)⊧φF\_\{a\_\{i\}\}\(P\)\\models\\varphi\. By self\-perspective idempotence, applying the same construction again givesFai\(Fai\(P\)\)=Fai\(P\)F\_\{a\_\{i\}\}\(F\_\{a\_\{i\}\}\(P\)\)=F\_\{a\_\{i\}\}\(P\)\. Thus, whenBaiφB\_\{a\_\{i\}\}\\varphiis evaluated insideFai\(P\)F\_\{a\_\{i\}\}\(P\), it returns to the same constructed perspective whereφ\\varphialready holds\. HenceFai\(P\)⊧BaiφF\_\{a\_\{i\}\}\(P\)\\models B\_\{a\_\{i\}\}\\varphi, and by Eq\. \([5](https://arxiv.org/html/2606.11724#A2.E5)\),P⊧BaiBaiφP\\models B\_\{a\_\{i\}\}B\_\{a\_\{i\}\}\\varphi\. Therefore,44holds\.
##### 5: Negative introspection\.
The55axiom states¬Baiφ→Bai¬Baiφ\\neg B\_\{a\_\{i\}\}\\varphi\\rightarrow B\_\{a\_\{i\}\}\\neg B\_\{a\_\{i\}\}\\varphi\. Intuitively, this axiom says that ifφ\\varphiis not supported by a character’s constructed perspective, then recursively querying that same character’s belief should not makeφ\\varphibecome supported\. To verify this forRecToM, supposeP⊧¬BaiφP\\models\\neg B\_\{a\_\{i\}\}\\varphi\. ThenFai\(P\)⊧̸φF\_\{a\_\{i\}\}\(P\)\\not\\models\\varphi\. By self\-perspective idempotence, applying the same construction again still yields the same perspective, soFai\(Fai\(P\)\)⊧̸φF\_\{a\_\{i\}\}\(F\_\{a\_\{i\}\}\(P\)\)\\not\\models\\varphi\. Therefore, insideFai\(P\)F\_\{a\_\{i\}\}\(P\), the character does not believeφ\\varphi, i\.e\.,Fai\(P\)⊧¬BaiφF\_\{a\_\{i\}\}\(P\)\\models\\neg B\_\{a\_\{i\}\}\\varphi\. Applying Eq\. \([5](https://arxiv.org/html/2606.11724#A2.E5)\) again givesP⊧Bai¬BaiφP\\models B\_\{a\_\{i\}\}\\neg B\_\{a\_\{i\}\}\\varphi\. Therefore,55holds\.
Since the belief operatorBaiB\_\{a\_\{i\}\}satisfiesKK,DD,44, and55,RecToM’s recursive perspective construction induces a KD45 belief modality\.
## Appendix CParameter Settings
RecToMoperates on all LLMs through API access\. For proprietary models, GPT\-5\.4 and Gemini\-3 Flash, we use the default parameter settings\. For open\-source models, we set temperature to 1\.0 and top\-ppto 0\.95 for both Qwen3\.5\-27B and Gemma\-4\-26B\-A4B, with top\-kkset to 20 for Qwen3\.5\-27B and 64 for Gemma\-4\-26B\-A4B to reduce repetitive outputs and filter rare tokens while preserving generation diversity\. All experiments are conducted on a virtual machine with four NVIDIA A100 80GB GPUs\. Our code will be released in the camera\-ready version\.
## Appendix DBenchmark Details
### D\.1Big\-ToM
Big\-ToM\(Gandhiet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib29)\), a GPT\-4\-generated benchmark, evaluates belief reasoning in natural\-language stories based on the Sally–Anne false\-belief paradigm\. We use the forward\-belief subset, which asks models to infer what a character believes after the character either observes or misses a belief\-relevant event\. This subset focuses on first\-order true\-belief and false\-belief questions, evaluating whether a character’s belief is consistent or inconsistent with reality\. The task is formulated as binary multiple choice, with a random baseline accuracy of 50%\. Figure[3](https://arxiv.org/html/2606.11724#A4.F3)shows an example true\-belief instance from the Big\-ToM forward\-belief subset\.
### D\.2FanToM
FanToM\(Kimet al\.,[2023](https://arxiv.org/html/2606.11724#bib.bib6)\)evaluates ToM reasoning in multi\-party dialogue settings\. Its dialogues introduce asymmetric information by allowing characters to enter and leave while the conversation continues, resulting in different characters observing different parts of the conversation\. We focus on FanToM belief questions, including first\-order and second\-order questions, which align with the scope of this work\. These questions are formulated as binary multiple\-choice tasks, with a random baseline accuracy of 50%\. Compared with narrative benchmarks, FanToM dialogues are longer and involve more characters and subtopics, requiring models to integrate extended dialogue context and maintain character\-specific beliefs\. Figure[4](https://arxiv.org/html/2606.11724#A5.F4)illustrates the FanToM dialogue structure with first\-order and second\-order belief questions\.
Figure 3:An example Big\-ToM instance with the first\-order belief question over a natural\-language event sequence\.
## Appendix EPrompt Templates forRecToM
Tables[6](https://arxiv.org/html/2606.11724#A5.T6)–[10](https://arxiv.org/html/2606.11724#A5.T10)present the main prompt templates used byRecToMon Hi\-ToM\. They cover fact\-based event abstraction, state and event observability identification, transient\-event belief revision, and final answer inference\. For readability, we show compact templates; the complete prompts are provided in the released code\.
Figure 4:An example FanToM instance with first\-order and second\-order belief questions over a dialogue sequence\.Prompt for Fact\-based Event Abstraction inRecToMYou are extracting step\-wise symbolic deltas from a ToM story\.Return valid JSON only\.Example output schema:```
{
"characters": ["Avery", "Charlotte"],
"steps": [
{
"step_index": 1,
"step_text": "Avery entered the living_room.",
"event_type": "persistent",
"added_facts": ["in_room(Avery,living_room)"],
"removed_facts": []
}
]
}
```
Important definition\.•Do not output a full state for each step\. Output only the delta for each step\.•event\_typeis eitherpersistentortransient\. Persistent events introduce or revise state facts; transient events record communication, claims, or questions\.•added\_factsare facts that become true because of the current step\.•removed\_factsare facts that stop being true because of the current step\.Extraction guidelines\.•Include every story step exactly once and include all human characters appearing in the story\.•Extract concise symbolic facts for belief\-relevant changes, such as character locations, object locations, communication events, and stated claims\.•Remove a fact only when the current step makes it false\.•Distinguish spoken claims from actual world facts when communication events appear\.•Represent private communication asprivate\_tell\(speaker,listener,proposition\)and public communication aspublic\_claim\(speaker,proposition\)\.•If the spoken content is an object\-location statement, represent the proposition asin\(object,container\)\.Story steps:\{story\_steps\}Assumptions:\{assumptions\}Table 6:Prompt template used byRecToMfor fact\-based event abstraction\. The LLM extracts symbolic deltas and classifies each event as persistent or transient\. The complete state sequence is then computed externally using the deterministic update rule\.Prompt for State Observability Identification inRecToMYou are deciding whether a target character is present in each aligned source state\.Return valid JSON only\.Example output schema:```
{
"character": "Alice",
"observation_basis": [
"at(Alice,kitchen)",
"at(Alice,kitchen)",
"not_observable"
]
}
```
Task\.•Return exactly one string for each aligned source state\.•Thennth string corresponds to thennth aligned source state\.•Each output string should be either a fact indicating the target character’s location ornot\_observable\.Decision rule\.•If the aligned source state contains a fact indicating that the target character is present in a location, return that fact\.•Presence may be expressed by facts such asat\(character,location\)orin\_room\(character,location\)\.•If no fact indicates the target character’s presence, returnnot\_observable\.•Do not return bare location names\.Target character:\{target\_character\}Aligned source states:\{aligned\_source\_states\}Table 7:Prompt template used byRecToMfor state observability identification\. Given aligned source states, the LLM derives the observation basis for the target character by identifying explicit presence facts or returningnot\_observable\.Prompt for Event Observability Identification inRecToMYou are deciding whether a target character can observe each aligned event facts\.Return valid JSON only\.Example output schema:```
{
"character": "Alice",
"observation_basis": [
"-in_room(Isabella,living_room)",
"private_tell(Alice,Bob,in(ball,bedroom))",
"public_claim(Bob,in(ball,kitchen))",
"not_observable"
]
}
```
Task\.•Return exactly one string for each aligned event facts\.•Thennth string corresponds to thennth aligned event facts\.•Each output string should be either the exact observable event fact ornot\_observable\.Decision rule\.•If the aligned event is empty, returnnot\_observable\.•For private communication, such asprivate\_tell\(speaker,listener,proposition\), only the speaker and listener can observe the event\.•For public communication, such aspublic\_claim\(speaker,proposition\), all characters can observe the event\.•For room\-entry or room\-exit facts, such as\+in\_room\(character,room\)or\-in\_room\(character,room\), the event is observable to all characters\.•Do not rewrite or normalize the event fact; return it exactly as it appears in the aligned event facts\.Target character:\{target\_character\}Aligned event facts:\{aligned\_event\_facts\}Table 8:Prompt template used byRecToMfor event observability identification\. Given aligned event facts, the LLM derives the observation basis for the target character by identifying observable communication events and room\-transition events, returning either the exact observable facts ornot\_observable\.Prompt for Transient\-Event Belief Revision inRecToMYou are applying an observable transient event to a symbolic belief state\.Return valid JSON only\.Example output schema:```
{
"state_before": [
"in(lettuce,green_drawer)",
"in_room(Avery,living_room)"
],
"action": [
"public_claim(Isabella,in(lettuce,green_bathtub))"
],
"action_added_facts": [
"in(lettuce,green_bathtub)"
],
"action_removed_facts": [
"in(lettuce,green_drawer)"
],
"state_after": [
"in(lettuce,green_bathtub)",
"in_room(Avery,living_room)"
]
}
```
Task\.•Start from the given state\.•Apply the observable transient event to this state\.•Output the facts added or removed by the event\.•Output the updated state after applying the event\.Revision rule\.•The transient event may represent communication, claims, questions, or other belief\-relevant actions\.•If the event contains an object\-location fact, use the fact to revise the current state\.•Preserve state facts that are not affected by the event\.State:\{state\}Observable transient event:\{observable\_transient\_event\}Table 9:Prompt template used byRecToMfor transient\-event belief revision\. Given the current belief state and an observable transient event, the LLM outputs the event\-induced added and removed facts and the resulting updated state\.Prompt for Answer Inference inRecToMYou are answering a multiple\-choice Theory\-of\-Mind question\.Return valid JSON only\.Example output schema:```
{
"reasoning_summary": "short summary",
"predicted_answer": "C"
}
```
Task\.•Use the provided state facts to answer the question\.•Select exactly one option from the candidate choices\.•The selected answer should correspond to the facts in the given state\.Output constraints\.•predicted\_answermust contain only the option letter, such asA,B, orC\.•Do not output the choice text inpredicted\_answer\.•Do not include additional characters or punctuation inpredicted\_answer\.State:\{state\}Question:\{question\}Choices:\{choices\}Table 10:Prompt template used byRecToMfor answer selection\. Given the final state from the constructed perspective, the LLM selects one candidate option and returns the predicted answer as an option letter\.Similar Articles
OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
This paper presents OSCToM, an RL-guided method for generating adversarial data to test nested belief conflicts in LLMs, improving Theory of Mind reasoning on benchmarks like FANToM.
OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
OmniToM introduces a benchmark that evaluates large language models' theory of mind by requiring explicit belief structure extraction and labeling, revealing a bottleneck in tracking actor-specific beliefs despite strong performance on endpoint QA tasks.
The Theory of Mind Utility: Formal Specification of a Mentalizing Mechanism
The paper introduces the Theory of Mind Utility (ToM-U), a formal computational-level specification for inferring others' epistemic states by constructing Local Epistemic World Models (LEWMs). It differs from Bayesian ToM and simulation theory by providing a domain-agnostic mechanism for belief inference without commitment to algorithmic implementation.
Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning
Proposes Agent-ToM, a learning-to-monitor framework using Theory-of-Mind reasoning to detect covert malicious behavior in autonomous LLM agents by inferring beliefs and intents, outperforming baseline monitors.
Recursive Language Models
This paper introduces Recursive Language Models (RLMs), an inference strategy that enables LLMs to process arbitrarily long prompts by treating them as external environments and recursively calling themselves over prompt snippets. RLMs handle inputs two orders of magnitude beyond context windows and outperform base LLMs on long-context tasks with comparable cost.