OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
Summary
OmniToM introduces a benchmark that evaluates large language models' theory of mind by requiring explicit belief structure extraction and labeling, revealing a bottleneck in tracking actor-specific beliefs despite strong performance on endpoint QA tasks.
View Cached Full Text
Cached at: 05/27/26, 09:03 AM
# OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
Source: [https://arxiv.org/html/2605.26322](https://arxiv.org/html/2605.26322)
Adam Bawatneh Sagar Sapkota Amrit Singh Bedi Santu Karmaker Mubarak Shah University of Central Florida, Orlando, Florida, USA adam\.bawatneh@ucf\.edu
###### Abstract
Theory of Mind \(ToM\), the ability to infer others’ knowledge, intentions, and emotions, is commonly evaluated in large language models \(LLMs\) using endpoint question answering, where performance is judged solely by the final answer to a social reasoning query\. This paradigm obscures whether the model actually constructs the underlying mental\-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs\. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative\. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor’s mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format\. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven\-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context\. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human\-calibrated LLM\-assisted annotation pipeline\. Across diverse models in zero\-shot evaluation, OmniToM reveals an actor\-specific belief\-tracking bottleneck: current LLMs struggle with the knowledge\-access and representational decisions required to transform narrative facts into actors’ beliefs and shared mental states\.
## 1Introduction
Figure 1:Comparison of evaluation paradigms on a false\-belief story\. Alice and Bob are in a room with a box and a basket; Alice places a ball in the box, Bob leaves, and Alice then moves the ball to the basket\. Top: Endpoint Question Answering \(QA\) evaluates only the final answer to“Where will Bob look?”; a model may answer correctly \(“Box”\), while the supporting mental\-state representation remains unobserved\. Bottom: Mental\-State Representation via Beliefs represents the same story as world facts and actor beliefs over time, making the hidden reasoning process visible\. The gold trace preserves Bob’s outdated belief after the hidden transfer: Bob still believes the ball is in the box\. The flawed trace illustrates the tracking error endpoint QA can hide, where the model incorrectly tracks Bob’s belief as being in the basket after an event he did not observe\.Social reasoning requires tracking how information is distributed across actors, not only what happened in the world\. To predict what a person will do, a model must represent what each actor knows, infers, intends, or falsely believes\. This capacity is commonly studied as Theory of Mind \(ToM\), the ability to attribute mental states such as beliefs, intentions, and emotions to oneself and others\[Premack and Woodruff,[1978](https://arxiv.org/html/2605.26322#bib.bib13), Beaudoinet al\.,[2020](https://arxiv.org/html/2605.26322#bib.bib14)\]\. For large language model \(LLM\) evaluation, the key question is therefore not merely whether a model can answer a social\-reasoning question, but whether it recovers the supporting multi\-actor mental\-state representation needed for robust ToM reasoning\. Existing LLM ToM benchmarks usually test this ability indirectly through endpoint question answering \(QA\): given a story or dialogue, the model is scored by whether it produces the correct final answer\[Leet al\.,[2019](https://arxiv.org/html/2605.26322#bib.bib10), Kimet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib17), Wuet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib18), Xuet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib16), Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\]\. As illustrated in Fig\.[1](https://arxiv.org/html/2605.26322#S1.F1), endpoint correctness leaves the underlying mental\-state representation unobserved\. A model may answer a false\-belief question correctly, while failing to represent the actor\-specific mental states that make the answer valid, including what each actor takes to be true, infers, remembers, or attributes to others\. Recent work has, therefore, raised concerns that endpoint ToM scores may reflect benchmark artifacts, shortcut strategies, or task scaffolding rather than robust mental\-state tracking\[Leet al\.,[2019](https://arxiv.org/html/2605.26322#bib.bib10), Sclaret al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib11), Piet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib15)\]\.
Endpoint QA also limits fine\-grained analysis\. ToM is not a single monolithic ability: it involves interacting aspects of mental\-state reasoning, including recursive belief attribution, factual alignment, information access, pragmatic inference, intentions, emotions, and temporal belief change\[Beaudoinet al\.,[2020](https://arxiv.org/html/2605.26322#bib.bib14), Wimmer and Perner,[1983](https://arxiv.org/html/2605.26322#bib.bib1), Perner and Wimmer,[1985](https://arxiv.org/html/2605.26322#bib.bib3), Flavellet al\.,[1986](https://arxiv.org/html/2605.26322#bib.bib2), Happé,[1994](https://arxiv.org/html/2605.26322#bib.bib4), Goodman and Stuhlmüller,[2013](https://arxiv.org/html/2605.26322#bib.bib6)\]\. Existing benchmarks provide valuable coverage of particular ToM settings, including false belief\[Leet al\.,[2019](https://arxiv.org/html/2605.26322#bib.bib10)\], higher\-order reasoning\[Wuet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib18)\], dialogue\-based information asymmetry\[Kimet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib17)\], perspective taking\[Xuet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib16)\], and broader multi\-task ToM evaluation\[Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\]\. However, when these abilities are evaluated primarily through endpoint answers or task\-specific outputs, it remains difficult to analyze how different aspects of the underlying mental\-state representation interact, for example whether a model failed because it missed a relevant actor, updated the wrong belief, misidentified who had access to information, or mislabeled the source or content of a mental state\. This leaves a central evaluation gap: existing benchmarks can test whether a model selects the correct endpoint answer, but they do not directly test whether the model recovered the multi\-actor mental\-state representation that makes the answer socially meaningful\.
Therefore, we introduce OmniToM, a benchmark designed to address this research gap through explicit belief\-structure modeling\. OmniToM operationalizes the supporting mental\-state representation as actor\-specific belief propositions: minimal statements of what an actor takes to be true about the world or another actor’s mental state\. This formulation provides a common format for analyzing knowledge, intentions, emotions, false beliefs, and nested mental states without reducing ToM to a single endpoint answer\. OmniToM evaluates this representation in two stages\. InStage 1: Belief Extraction, a model extracts from the story the belief propositions relevant to its social dynamics\. InStage 2: Belief Labeling, the model labels each belief proposition under a unified seven\-dimensional schema grounded in ATOMS \(*Abilities in Theory of Mind Space*\), a literature\-derived taxonomy of ToM abilities\[Beaudoinet al\.,[2020](https://arxiv.org/html/2605.26322#bib.bib14)\]\. ATOMS guides the range of mental\-state reasoning that OmniToM aims to model\. OmniToM operationalizes this coverage through seven belief\-level dimensions: recursive belief depth \(e\.g\.,Bob believes Alice thinksXX\) \(Order\); alignment with story reality \(Truth Status\); who can access or share the information \(Knowledge Access\); stated versus inferred content \(Representation\); belief subject matter \(Content Type\); acquisition source \(Mental Source\); and whether any special framing condition applies \(Context\)\.
OmniToM is built from 895 stories derived from ToMBench, a prior ToM benchmark, and augmented with 22,343 labeled belief propositions\. Its construction was supported by over 1K person\-hours of human annotation effort for benchmark development, before a human\-calibrated LLM\-assisted annotation pipeline was fixed and scaled to the full benchmark\. Across diverse open\- and closed\-source models under zero\-shot evaluation, Stage 2 belief\-labeling accuracy reaches 85\.95%, while Stage 1 extractionF1F\_\{1\}peaks at 57\.69%\. More importantly, OmniToM localizes the same bottleneck across both stages: Stage 1 performance drops when story facts must be assigned to actor\-specific beliefs, and Stage 2 errors concentrate onKnowledge AccessandRepresentation\. This suggests that current LLMs struggle not simply to parse social stories, but to track which information each actor has, how it is communicated or inferred, and how it becomes part of that actor’s mental\-state representation\.
Our core contributions are threefold:
- •We introduce OmniToM, a benchmark of 895 ToMBench\-derived stories and 22,343 labeled belief propositions, developed through over 1K person\-hours of human annotation effort to evaluate multi\-actor mental\-state representations beyond endpoint answers\.
- •We introduce an ATOMS\-grounded belief\-level schema for fine\-grained ToM analysis, translating task\-level ability coverage into seven per\-proposition dimensions:Order,Truth Status,Knowledge Access,Representation,Content Type,Mental Source, andContext\.
- •We evaluate diverse open\- and closed\-source LLMs and find an actor\-specific information\-tracking bottleneck: models struggle to determine which story facts each actor knows, shares, or infers, and how those facts become beliefs\.
## 2Related Work
Table 1:Comparison with representative ToM benchmarks for language models\. We compare each benchmark’s evaluation format and whether it explicitly evaluates the seven OmniToM schema dimensions as separate metrics\. A checkmark \(✓\) indicates explicit evaluation; a dash \(–\) indicates that the dimension is absent or only implicit\.#### Endpoint QA and mental\-state representations\.
LLM benchmarks for Theory of Mind \(ToM\) predominantly evaluate social reasoning through endpoint question answering \(QA\): a model reads a story or dialogue and is scored by whether it returns the correct final answer\[Leet al\.,[2019](https://arxiv.org/html/2605.26322#bib.bib10), Wuet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib18), Kimet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib17), Xuet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib16), Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\]\. As illustrated in Fig\.[1](https://arxiv.org/html/2605.26322#S1.F1), endpoint QA can leave the supporting mental\-state representation unobserved: a model may answer correctly without tracking what each actor takes to be true, remembers, infers, intends, feels, or attributes to others\. Recent work has begun to make such intermediate reasoning more explicit\. SymbolicToM introduces a multi\-character belief tracker, but uses it as a scaffold for improving downstream QA rather than as the primary benchmark target\[Sclaret al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib11)\]\. Perceptions\-to\-Beliefs evaluates a narrower precursor pathway, asking whether models can infer what characters perceive and convert those perceptions into beliefs using perception annotations added to ToMi and FANToM\[Junget al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib12)\]\. This shows that models may identify perceptual access while still struggling to infer the beliefs that follow from it\. OmniToM targets a broader representation: the full set of actor\-specific belief propositions relevant to a story’s social dynamics, including beliefs grounded in perception, memory, testimony, inference, imagination, and higher\-order attribution\.
#### Ability spaces and schema\-guided analysis\.
Evaluating mental\-state representations requires analyzing which aspects of those representations succeed or fail\. ATOMS \(Abilities in Theory of Mind Space\) organizes ToM measures into task\-level ability categories and sub\-abilities, including beliefs, knowledge, intentions, desires, emotions, percepts, and non\-literal communication\[Beaudoinet al\.,[2020](https://arxiv.org/html/2605.26322#bib.bib14)\]\. This provides a principled coverage scaffold, but not a direct label inventory for individual belief propositions\. OmniToM adapts this ability\-space perspective into an ATOMS\-grounded belief\-level schema that labels each proposition by recursive depth \(Order\), factual alignment \(Truth Status\), information sharing \(Knowledge Access\), explicitness \(Representation\), subject matter \(Content Type\), acquisition source \(Mental Source\), and framing \(Context\)\. Prior benchmarks instantiate important subsets of this space: ToMi focuses on false\-belief QA, Hi\-ToM on higher\-order reasoning, FANToM on information asymmetry, OpenToM on physical and psychological states, and ToMBench on multi\-task ToM QA\[Leet al\.,[2019](https://arxiv.org/html/2605.26322#bib.bib10), Wuet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib18), Kimet al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib17), Xuet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib16), Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\]\. Table[1](https://arxiv.org/html/2605.26322#S2.T1)compares these benchmarks by evaluation format and by whether they explicitly score OmniToM’s seven dimensions, highlighting OmniToM’s shift from endpoint evaluation to structured evaluation of the belief representation that supports those answers\.
#### LLM\-assisted benchmark construction and evaluation\.
Constructing dense belief\-structure benchmarks is costly: each story can require many actor\-specific propositions, and each proposition must be labeled along multiple schema dimensions\. Recent work increasingly uses LLMs to support scalable data annotation and synthesis when fully manual labeling is impractical\[Tanet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib29)\]\. OmniToM uses LLMs in this spirit, but constrains their role through task\-conditioned prompting and human calibration\. In particular, we use TELeR, a prompt taxonomy for benchmarking complex tasks, to specify task\-conditioned extraction, labeling, and evaluation prompts\[Karmaker Santu and Feng,[2023](https://arxiv.org/html/2605.26322#bib.bib9)\], then calibrate the resulting pipeline on a human\-annotated subset before scaling to the full benchmark\. Because OmniToM also uses LLM\-based semantic evaluation for open\-ended belief extraction, it follows prior LLM\-as\-a\-judge work\[Zhenget al\.,[2023](https://arxiv.org/html/2605.26322#bib.bib30)\]while adopting the human\-checking emphasis highlighted by JudgeBench\[Tanet al\.,[2025](https://arxiv.org/html/2605.26322#bib.bib31)\]\. Thus, LLMs support benchmark construction and evaluation, but the final pipeline is task\-conditioned, human\-calibrated, and selected through agreement checks before use\.
## 3Benchmark Formulation
#### Formal Task Definition\.
OmniToM formulates benchmark evaluation as*explicit belief\-structure modeling*: given a story, a model extracts a structured representation composed of narrated world facts and belief propositions held by all relevant actors, then labels each extracted proposition under a shared schema\. Concretely \(Fig\.[2](https://arxiv.org/html/2605.26322#S3.F2)\), Stage 1 maps the story𝒮\\mathcal\{S\}to a belief structure,fext\(𝒮\)→𝐁f\_\{ext\}\(\\mathcal\{S\}\)\\rightarrow\\mathbf\{B\}, and Stage 2 labels the extracted belief propositions according to the schema,flabel\(𝒮,𝐁ab\)→𝐕f\_\{label\}\(\\mathcal\{S\},\\mathbf\{B\}\_\{ab\}\)\\rightarrow\\mathbf\{V\}\. Thus, the benchmark consists of two linked evaluation targets: belief\-structure extraction followed by belief labeling\. Both stages condition on the complete story, preserving multi\-actor dependencies, information asymmetries, and belief changes across the narrative\.
#### Stage 1: Belief Extraction\.
Underfext\(𝒮\)→𝐁f\_\{ext\}\(\\mathcal\{S\}\)\\rightarrow\\mathbf\{B\}, the extracted belief structure is𝐁=\{\(ai,bi,oi\)\}i=1n\\mathbf\{B\}=\\\{\(a\_\{i\},b\_\{i\},o\_\{i\}\)\\\}\_\{i=1\}^\{n\}, a structured set of propositions: narrated world facts and actor\-specific belief propositions, each with an order label\. Here,aia\_\{i\}denotes either the actor \(character or group\) holding the belief or the special actorworld, which marks narrated facts not attributed to any single character’s internal state\. The propositionbib\_\{i\}denotes the minimal content being represented, andoi∈\{0,1,2,3\}o\_\{i\}\\in\\\{0,1,2,3\\\}denotes recursive belief order, with0reserved for world\-level facts\. Following classic work on false belief and recursive belief attribution\[Wimmer and Perner,[1983](https://arxiv.org/html/2605.26322#bib.bib1), Perner and Wimmer,[1985](https://arxiv.org/html/2605.26322#bib.bib3)\], Stage 1 extracts the structure recursively: it first identifies narrated world facts, then the relevant actors, then each actor’s first\-order beliefs about the world, and finally higher\-order beliefs about other actors’ beliefs\. This formulation focuses on the structured belief dependencies that determine each actor’s interpretation of the story\. It captures belief inaccuracies, knowledge asymmetries, and complex story\-based mental\-state reasoning expressed in text\[Happé,[1994](https://arxiv.org/html/2605.26322#bib.bib4)\]\. Temporal progression is implicit in the ordering of𝐁\\mathbf\{B\}, while changes in belief state are further labeled through the schema introduced in Stage 2\.
Figure 2:OmniToM two\-stage functional workflow\. Stage 1 \(Belief Extraction\) maps the story𝒮\\mathcal\{S\}to extracted world\-fact and actor\-belief propositions,fext\(𝒮\)→𝐁f\_\{ext\}\(\\mathcal\{S\}\)\\rightarrow\\mathbf\{B\}, where𝐁\\mathbf\{B\}contains actoraia\_\{i\}, belief propositionbib\_\{i\}, and orderoio\_\{i\}tuples and uses the special actorworldfor narrated facts\. Stage 2 \(Belief Labeling\) takes the story and extracted propositions,flabel\(𝒮,𝐁ab\)→𝐕f\_\{label\}\(\\mathcal\{S\},\\mathbf\{B\}\_\{ab\}\)\\rightarrow\\mathbf\{V\}, and outputs one seven\-dimensional schema vector per proposition,𝐬i=\(oi,ti,ki,ri,ci,mi,xi\)\\mathbf\{s\}\_\{i\}=\(o\_\{i\},t\_\{i\},k\_\{i\},r\_\{i\},c\_\{i\},m\_\{i\},x\_\{i\}\), corresponding toOrder\(oio\_\{i\}\),Truth Status\(tit\_\{i\}\),Knowledge Access\(kik\_\{i\}\),Representation\(rir\_\{i\}\),Content Type\(cic\_\{i\}\),Mental Source\(mim\_\{i\}\), andContext\(xix\_\{i\}\)\. The visualization shows condensed snapshots from a canonical false\-belief example; worked examples appear in App\.[E](https://arxiv.org/html/2605.26322#A5)\.
#### Stage 2: Belief Labeling\.
Given the story and extracted belief propositions, Stage 2 labels each proposition with a seven\-dimensional schema vector,flabel\(𝒮,𝐁ab\)→𝐕f\_\{label\}\(\\mathcal\{S\},\\mathbf\{B\}\_\{ab\}\)\\rightarrow\\mathbf\{V\}, where𝐬i=\(oi,ti,ki,ri,ci,mi,xi\)\\mathbf\{s\}\_\{i\}=\(o\_\{i\},t\_\{i\},k\_\{i\},r\_\{i\},c\_\{i\},m\_\{i\},x\_\{i\}\)\. The schema is grounded in ATOMS \(*Abilities in Theory of Mind Space*\)\[Beaudoinet al\.,[2020](https://arxiv.org/html/2605.26322#bib.bib14)\], which organizes ToM measures by the abilities they test\. OmniToM adapts this task\-level ability space into a belief\-level schema for fine\-grained analysis: each extracted proposition is labeled by the representational properties needed to model the underlying mental state, rather than only by the endpoint task category\. For example, a false\-belief task is not represented by a single “false belief” label; it is decomposed into belief\-level properties such as who holds the belief, whether it conflicts with reality, who had access to the relevant information, whether the belief is explicit or inferred, and whether the belief is temporally outdated\. Intuitively, the schema asks seven questions about each proposition: how deeply nested it is, whether it is true, who could know it, whether it is explicit or inferred, what it is about, how it was acquired, and what special framing applies\. Additional schema\-label details are provided in App\.[A](https://arxiv.org/html/2605.26322#A1)\. The schema consists of seven dimensions:
- •Order\(oio\_\{i\};\{0,1,2,3\}\\\{0,1,2,3\\\}\) captures the recursive depth of belief attribution: a first\-order belief is“Bob believesXX,”whereas a second\-order belief, such as“Bob believes Alice thinksXX”, reasons about one actor’s belief about another’s belief\. Beyond order 3, calibration annotations found insufficient reliable narrative evidence for further nesting\.
- •Truth Status\(tit\_\{i\}; \{True, False, Unknown\}\) separates belief attribution from factual correctness, which is central to false\-belief and appearance–reality phenomena\.
- •Knowledge Access\(kik\_\{i\}; \{Private, Shared, Public\}\) encodes how information is distributed across actors, enabling explicit analysis of ignorance, asymmetry, and deception\.
- •Representation\(rir\_\{i\}; \{Explicit, Implicit\}\) distinguishes directly stated beliefs from pragmatically inferred beliefs, helping isolate failures in pragmatic inference\.
- •Content Type\(cic\_\{i\}; \{Location, Contents/Physical State, Identity/Relation, Epistemic, Desire/Intention, Emotion, Trait/Value, Action/Event\}\) identifies what is believed, allowing errors to be localized by semantic type \(e\.g\., location tracking versus intention inference\)\.
- •Mental Source\(mim\_\{i\}; \{Narration, Perception, Memory, Testimony, Inference, Imagination, Unknown\}\) tracks how a belief is acquired, supporting source\-sensitive analysis\.
- •Context\(xix\_\{i\}; \{Deceptive, Temporal, Counterfactual, Neutral\}\) captures story framing conditions that modulate interpretation and belief updating\.
Together, these dimensions provide a compact yet expressive representation of belief structure: they preserve broad cognitive coverage while making model behavior auditable in terms of reasoning depth, access structure, semantic content, acquisition source, and contextual framing\[Flavellet al\.,[1986](https://arxiv.org/html/2605.26322#bib.bib2), Baron\-Cohenet al\.,[1999](https://arxiv.org/html/2605.26322#bib.bib5), Leslie,[1987](https://arxiv.org/html/2605.26322#bib.bib7), Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\]\. In this way, Stage 1 extracts the belief structure itself, while Stage 2 labels that structure along interpretable dimensions for finer\-grained ToM analysis\.
## 4Benchmark Construction
Figure 3:OmniToM human\-calibrated benchmark\-construction pipeline\. Stories from seven ToMBench categories are split into a 21\-story calibration subset and an 895\-story benchmark subset\. The calibration subset provides human\-annotated gold structures for the two\-stage setup: Stage 1 scores belief extraction with GPT\-5 semantic\-judgeF1F\_\{1\}, and Stage 2 scores belief labeling with labeling accuracy\. These calibration scores are used to select Claude\-Sonnet\-4\.5 as the benchmark annotation model, which is then applied to the benchmark subset to produce the final OmniToM benchmark with 895 stories and 22,343 labeled belief propositions\.#### Source Data and Calibration Pipeline\.
Figure 4:OmniToM label distribution statistics\. Label frequencies over 156,401 labels\.OmniToM sources story text from ToMBench\[Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\], a multi\-task Theory\-of\-Mind benchmark of short social\-reasoning stories\. To support explicit belief\-structure modeling from text alone, we retain seven high\-volume categories with sufficiently self\-contained mental\-state evidence: Ambiguous Story Task, False Belief Task, Faux\-pas Recognition Test, Hinting Task Test, Persuasion Story Task, Scalar Implicature Test, and Strange Story Task\. These categories account for 916 source stories; we reserve 21 for calibration \(three per category\) and use the remaining 895 as the benchmark split\. Constructing OmniToM requires identifying actor\-specific belief propositions and assigning seven schema labels per proposition, making fully manual annotation costly and cognitively demanding at scale\. We therefore use a human\-calibrated, LLM\-assisted pipeline that preserves human judgments while enabling benchmark\-scale annotation\. The gold calibration structures were developed through a multi\-month, multi\-team effort exceeding 1K person\-hours across schema refinement, belief extraction, schema labeling, and validation\.
#### Task\-Conditioned LLM Annotations\.
We use TELeR\[Karmaker Santu and Feng,[2023](https://arxiv.org/html/2605.26322#bib.bib9)\], a taxonomy for systematic prompt design, to specify structured prompts for extraction, labeling, and evaluation; the taxonomy and our prompt\-level calibration are detailed in App\.[C](https://arxiv.org/html/2605.26322#A3)\. Benchmark construction uses TELeR Level 4 prompts, which achieved the strongest calibration performance in our prompt\-level sweep\. These prompts combine a task directive, explicit sub\-task decomposition, output criteria, and in\-context examples, providing the structure needed for reliable table\-form annotations\. Stage 1 extraction uses category\-conditioned prompts because the relevant belief dynamics differ across story types\. For example, false\-belief stories emphasize outdated beliefs and information access, hinting and strange stories emphasize pragmatic inference, and persuasion stories emphasize goals and influence\. Stage 2 uses schema\-invariant prompts because the seven\-dimensional labeling schema is shared across all categories and is intended to provide a unified belief\-level representation\.
#### Calibration and Model Selection\.
We evaluate candidate benchmark annotation models on the metrics defined in Sec\.[5](https://arxiv.org/html/2605.26322#S5)\. Stage 1F1F\_\{1\}measures belief extraction, requiring semantic alignment between predicted and gold belief propositions\. For this alignment step, we use an LLM\-as\-a\-judge selected by agreement with human semantic\-alignment decisions on 126 human\-evaluated prediction tables sampled evenly from Qwen3 32B\[Yanget al\.,[2025](https://arxiv.org/html/2605.26322#bib.bib27)\], Gemma\-3 27B\[Gemma Team,[2025](https://arxiv.org/html/2605.26322#bib.bib22)\], and Mistral\-Large 123B\[Mistral AI,[2024](https://arxiv.org/html/2605.26322#bib.bib25)\]extraction outputs\. Among GPT\-5\[OpenAI,[2025](https://arxiv.org/html/2605.26322#bib.bib20)\], Gemini\-2\.5 Flash\[Google,[2025](https://arxiv.org/html/2605.26322#bib.bib21)\], Claude\-Sonnet\-4\.5\[Anthropic,[2025](https://arxiv.org/html/2605.26322#bib.bib19)\], Llama\-3\.3 70B\[Meta,[2024b](https://arxiv.org/html/2605.26322#bib.bib24)\], and DeepSeek\-R1\-Distill\-Qwen 32B\[DeepSeek\-AI,[2025](https://arxiv.org/html/2605.26322#bib.bib28)\], GPT\-5 achieves the highest agreement with human semantic judgments \(72\.03%\) and is used as the independent judge for Stage 1 evaluation\. Stage 2 accuracy measures closed\-set schema labeling by exact match against gold labels\. Using Stage 1F1F\_\{1\}and Stage 2 accuracy, we compare Claude\-Sonnet\-4\.5, Mistral\-Large 123B, Qwen3 32B, Llama\-3\.3 70B, Gemini\-2\.5 Flash, Gemma\-3 27B, and GPT\-5 as candidate benchmark annotation models\. Claude\-Sonnet\-4\.5 achieves the strongest calibration performance across both stages, with 72\.88 Stage 1F1F\_\{1\}and 93\.62 Stage 2 accuracy, and is therefore used to generate the final benchmark\.
#### Final Benchmark Statistics\.
Applying this calibrated pipeline to the 895\-story benchmark split yields 22,343 labeled belief propositions and 156,401 schema labels; Fig\.[4](https://arxiv.org/html/2605.26322#S4.F4)summarizes their distribution across the seven schema dimensions\. Belief\-order distribution is Order 0: 32\.6%, Order 1: 57\.1%, Order 2: 9\.8%, and Order 3: 0\.5%\. Additional details on source filtering, human agreement, semantic\-judge calibration, model selection, and benchmark statistics are reported in App\.[B](https://arxiv.org/html/2605.26322#A2)\.
## 5Evaluation Protocol
Table 2:Example of Stage 1 semantic alignment for belief extraction\.MatchCountindicates how each predicted belief aligns with the gold belief set, and vice versa: 0 denotes no semantic match, 1 denotes a one\-to\-one match, and 2 denotes a compound match\. ThemcR=2mc\_\{R\}=2example shows a gold belief whose content is covered by two predicted beliefs\.Story: Xiao Hong wants to change to a bigger office, but that office is occupied by her colleague Xiao Li\.
Predicted Table𝐁^\\hat\{\\mathbf\{B\}\}
Belief Proposition𝐦𝐜𝐏\\mathbf\{mc\_\{P\}\}Actor: worldXiao Hong wants to change to a bigger office1The bigger office is occupied by Xiao Li1Actor: Xiao HongXiao Hong wants to change to a bigger office1The bigger office is occupied by Xiao Li1Xiao Hong needs to persuade Xiao Li to vacate the office1Xiao Li might vacate the office if persuaded1
Gold Table𝐁∗\\mathbf\{B\}^\{\*\}
Belief Proposition𝐦𝐜𝐑\\mathbf\{mc\_\{R\}\}Actor: worldXiao Hong wants to change to a bigger office1The bigger office is occupied by Xiao Li1Xiao Hong and Xiao Li are colleagues0Actor: Xiao HongXiao Hong needs the bigger office that Xiao Li occupies2Xiao Hong must persuade Xiao Li to leave the bigger office1Xiao Li may swap offices if Xiao Hong persuades him1
OmniToM uses two evaluation protocols aligned with the two\-stage benchmark formulation in Sec\.[3](https://arxiv.org/html/2605.26322#S3)\. Stage 1 measures*belief\-extraction completeness*: whether a model extracts the belief propositions relevant to the story’s social dynamics\. Stage 2 measures*belief\-labeling accuracy*: whether provided belief propositions are assigned the correct seven\-dimensional schema labels\. In both stages, scores are computed per story and macro\-averaged across the corpus\.
#### Stage 1 Evaluation: Belief\-Extraction Completeness \(F1F\_\{1\}\)\.
Given predicted beliefs𝐁^\\hat\{\\mathbf\{B\}\}and gold beliefs𝐁∗\\mathbf\{B\}^\{\*\}, Stage 1 evaluates how well the predicted beliefs recover the gold belief structure\. Exact string matching is too strict because equivalent beliefs can differ in wording or granularity: a compound predicted belief may cover multiple atomic gold beliefs, or several predicted beliefs may together express the content of one gold belief\. We therefore useMatchCountto align predicted beliefs to gold beliefs\. For a predicted beliefb∈𝐁^b\\in\\hat\{\\mathbf\{B\}\},mcP\(b\)mc\_\{P\}\(b\)denotes how many gold beliefs it semantically matches for the same actor:
mcP\(b\)=\{0,ifbmatches no gold belief,1,ifbmatches one gold belief,2or3,ifbis compound and matches two or three distinct gold beliefs\.mc\_\{P\}\(b\)=\\begin\{cases\}0,&\\text\{if \}b\\text\{ matches no gold belief,\}\\\\ 1,&\\text\{if \}b\\text\{ matches one gold belief,\}\\\\ 2\\text\{ or \}3,&\\text\{if \}b\\text\{ is compound and matches two or three distinct gold beliefs\.\}\\end\{cases\}One\-to\-many matches are used conservatively: counts of 2 or 3 are assigned only for compound beliefs that map to multiple distinct gold beliefs, and human calibration found no cases requiring counts above 3 \(App\.[B\.4](https://arxiv.org/html/2605.26322#A2.SS4)\)\. We apply the same procedure in reverse to computemcR\(b\)mc\_\{R\}\(b\)for each gold beliefb∈𝐁∗b\\in\\mathbf\{B\}^\{\*\}\. Thus,mcPmc\_\{P\}measures how predicted beliefs cover the gold set, whilemcRmc\_\{R\}measures how gold beliefs are recovered by the predictions\. As illustrated in Table[2](https://arxiv.org/html/2605.26322#S5.T2), bidirectional alignment provides the matched belief counts used for precision and recall while preventing duplicate or paraphrased beliefs from receiving repeated credit\.
For each storyss, letMP\(s\)M\_\{P\}^\{\(s\)\}andMR\(s\)M\_\{R\}^\{\(s\)\}denote the number of matched predicted and gold beliefs, respectively, i\.e\., beliefs with nonzeroMatchCount\. Let\|𝐁^\(s\)\|\|\\hat\{\\mathbf\{B\}\}^\{\(s\)\}\|and\|𝐁∗\(s\)\|\|\\mathbf\{B\}^\{\*\(s\)\}\|denote the number of predicted and gold beliefs for storyss\. We compute story\-level precision, recall, andF1F\_\{1\}, and then macro\-average across theSSevaluated stories:
P\(s\)=MP\(s\)\|𝐁^\(s\)\|,R\(s\)=MR\(s\)\|𝐁∗\(s\)\|,F1\(s\)=2P\(s\)R\(s\)P\(s\)\+R\(s\),F1,macro=1S∑s=1SF1\(s\)\.P^\{\(s\)\}=\\frac\{M\_\{P\}^\{\(s\)\}\}\{\|\\hat\{\\mathbf\{B\}\}^\{\(s\)\}\|\},\\qquad R^\{\(s\)\}=\\frac\{M\_\{R\}^\{\(s\)\}\}\{\|\\mathbf\{B\}^\{\*\(s\)\}\|\},\\qquad F\_\{1\}^\{\(s\)\}=\\frac\{2P^\{\(s\)\}R^\{\(s\)\}\}\{P^\{\(s\)\}\+R^\{\(s\)\}\},\\qquad F\_\{1,\\mathrm\{macro\}\}=\\frac\{1\}\{S\}\\sum\_\{s=1\}^\{S\}F\_\{1\}^\{\(s\)\}\.
#### Stage 2 Evaluation: Belief\-Labeling Accuracy\.
Given predicted schema labelsy^i,d\(s\)\\hat\{y\}^\{\(s\)\}\_\{i,d\}and gold labelsyi,d∗\(s\)y^\{\*\(s\)\}\_\{i,d\}, Stage 2 evaluates exact\-match accuracy over the gold belief propositions\. Here,ssindexes the story,iiindexes a gold belief within that story,ddindexes one of the seven schema dimensions, andNgold\(s\)N^\{\(s\)\}\_\{\\mathrm\{gold\}\}is the number of gold beliefs in storyss\. A prediction is counted as correct wheny^i,d\(s\)=yi,d∗\(s\)\\hat\{y\}^\{\(s\)\}\_\{i,d\}=y^\{\*\(s\)\}\_\{i,d\}\. We compute per\-dimension accuracy and overall story\-level labeling accuracy as:
Accd\(s\)=1Ngold\(s\)∑i=1Ngold\(s\)𝟏\[y^i,d\(s\)=yi,d∗\(s\)\],Accoverall\(s\)=17∑d=17Accd\(s\)\.\\mathrm\{Acc\}^\{\(s\)\}\_\{d\}=\\frac\{1\}\{N^\{\(s\)\}\_\{\\mathrm\{gold\}\}\}\\sum\_\{i=1\}^\{N^\{\(s\)\}\_\{\\mathrm\{gold\}\}\}\\mathbf\{1\}\\\!\\left\[\\hat\{y\}^\{\(s\)\}\_\{i,d\}=y^\{\*\(s\)\}\_\{i,d\}\\right\],\\qquad\\mathrm\{Acc\}^\{\(s\)\}\_\{\\mathrm\{overall\}\}=\\frac\{1\}\{7\}\\sum\_\{d=1\}^\{7\}\\mathrm\{Acc\}^\{\(s\)\}\_\{d\}\.We report both per\-dimension and overall Stage 2 accuracy by macro\-averaging these story\-level scores across stories\.
## 6Experiments
Table 3:Main OmniToM benchmark results under zero\-shot TELeR Level 3 prompts \(%\)\. Stage 1 reports category\-wise and overall macroF1F\_\{1\}; Stage 2 reports per\-dimension and overall macro belief\-labeling accuracy\. GPT\-5 is omitted from Stage 1 because it serves as the semantic judge\. Best is bold, and second\-best is underlined\.Main Benchmark Results \(%\)ModelParamsStage 1: Belief ExtractionF1F\_\{1\}Stage 2: Belief\-Labeling AccuracyASTFBTFPTHTPSTSITSSTOverallOrderStatusAccessReprCTypeSourceContextOverallClosed\-source modelsGemini\-2\.5 FlashN/A42\.4056\.4857\.7850\.3458\.5562\.9156\.3154\.9795\.5684\.9771\.3487\.5885\.9784\.1092\.1485\.95GPT\-5N/AN/AN/AN/AN/AN/AN/AN/AN/A95\.1882\.7266\.8583\.4279\.9683\.0288\.8382\.85Open\-source modelsGemma\-3 27B27B48\.7272\.3956\.0545\.4656\.7268\.7655\.7757\.6996\.5682\.4471\.5754\.3373\.5078\.7292\.0778\.46Llama\-3\.1 8B8B26\.3448\.2935\.8031\.4836\.1253\.5230\.3737\.4271\.9065\.5956\.1364\.4048\.6355\.1876\.8162\.66Llama\-3\.3 70B70B37\.5164\.0746\.3336\.2747\.2357\.7041\.5847\.2492\.7483\.5567\.4172\.4372\.3576\.7191\.6979\.55Mistral\-Small 24B24B52\.9754\.5859\.7948\.3256\.9766\.2053\.1756\.0095\.1382\.2274\.5962\.7976\.0184\.8291\.9081\.06Mistral\-Large 123B123B47\.7571\.2853\.6641\.7858\.5357\.3848\.3554\.1097\.2586\.5374\.1472\.8782\.8386\.3292\.9784\.70Qwen3 8B8B39\.1250\.2244\.2137\.1448\.3647\.2037\.6043\.4173\.3867\.1757\.9463\.7751\.4361\.4974\.6264\.26Qwen3 32B32B46\.8857\.3253\.3841\.4157\.2556\.5148\.6751\.6396\.4282\.4373\.9162\.4571\.2776\.8490\.8179\.16
#### Experimental Setup and Baselines\.
We evaluate models under zero\-shot TELeR Level 3 prompts\[Karmaker Santu and Feng,[2023](https://arxiv.org/html/2605.26322#bib.bib9)\]\. Level 3 provides a task directive and stepwise sub\-tasks, but no in\-context examples or category\-specific evaluation\-criteria\. This setting gives models the task definition while avoiding the stronger construction\-time scaffolding used in Level 4 prompts, allowing OmniToM to measure how well models can extract belief structures and label belief propositions from instructions alone\. We benchmark nine models in total\. The API baselines are gemini\-2\.5\-flash\[Google,[2025](https://arxiv.org/html/2605.26322#bib.bib21)\]and GPT\-5\[OpenAI,[2025](https://arxiv.org/html/2605.26322#bib.bib20)\]\. The open\-weight baselines are gemma\-3\-27b\-it\[Gemma Team,[2025](https://arxiv.org/html/2605.26322#bib.bib22)\], Llama\-3\.1\-8B\-Instruct\[Meta,[2024a](https://arxiv.org/html/2605.26322#bib.bib23)\], Llama\-3\.3\-70B\-Instruct\[Meta,[2024b](https://arxiv.org/html/2605.26322#bib.bib24)\], Mistral\-Small\-24B\-Instruct\-2501\[Mistral AI,[2025](https://arxiv.org/html/2605.26322#bib.bib26)\], Mistral\-Large\-Instruct\-2407\[Mistral AI,[2024](https://arxiv.org/html/2605.26322#bib.bib25)\], Qwen3\-8B, and Qwen3\-32B\[Yanget al\.,[2025](https://arxiv.org/html/2605.26322#bib.bib27)\]\. Table[3](https://arxiv.org/html/2605.26322#S6.T3)uses shortened display labels for space\. Open\-weight evaluations were run on NVIDIA A100 80GB GPUs with 4\-bit quantization where required and required approximately 48 GPU\-hours per open\-weight model\. GPT\-5 is excluded from Stage 1 model evaluation since it serves as the semantic judge\.
#### Main Results Summary\.
Results in Table[3](https://arxiv.org/html/2605.26322#S6.T3)and Fig\.[5](https://arxiv.org/html/2605.26322#S6.F5)identify an actor\-specific information\-tracking bottleneck\. The strongest Stage 1 model reaches 57\.69 macroF1F\_\{1\}, and Stage 2 accuracy reaches 85\.95%\. Together, the two stages show that errors emerge when models must map story facts onto actors’ information states: extraction drops for actor beliefs, and labeling errors concentrate onKnowledge AccessandRepresentation, the dimensions that specify who could know or share a belief and whether it is stated or inferred\.
#### Stage 1: Belief Extraction\.
Stage 1 localizes the structural side of this bottleneck\. Fig\.[5](https://arxiv.org/html/2605.26322#S6.F5)a shows a consistent drop from Order 0 world facts to Order 1 and Order 2\+ actor beliefs\. Moving beyond Order 0 requires more than decomposing the story into facts: the model must determine which facts each actor perceived, missed, remembered, was told, or could infer\. Higher\-order beliefs add another layer, requiring the model to represent one actor’s view of another actor’s information state\. Thus, Stage 1 failures reflect difficulty converting story information into actor\-indexed belief states\.


\(a\)
\(b\)
Figure 5:Order\-wise analysis\. \(a\) Stage 1 extractionF1F\_\{1\}by model and belief\-order bucket\. \(b\) Stage 2 labeling accuracy by belief\-order bucket and schema dimension, showing that Order 1 beliefs are hardest to label, especially forKnowledge AccessandRepresentation\.
#### Stage 2: Belief Labeling\.
Stage 2 explains this bottleneck at the schema\-label level\. As shown in Table[3](https://arxiv.org/html/2605.26322#S6.T3), models are weakest onKnowledge Access\(56\.13–74\.59%\) andRepresentation\(54\.33–87\.58%\), the two dimensions most directly tied to information distribution\. Fig\.[5](https://arxiv.org/html/2605.26322#S6.F5)b shows the same pattern by belief order: Order 1 actor beliefs have the lowest overall labeling accuracy \(71\.2%\), with especially low accuracy forKnowledge Access\(58\.9%\) andRepresentation\(57\.3%\)\. These labels require deciding who could know or share a belief, and whether the belief is directly stated or inferred from perception, testimony, interaction, or context\. Order 2\+ labels partially recover overall \(74\.4%\) because some supplied higher\-order propositions are more schema\-constrained, for example, theirContent Typeis often epistemic by construction\. Thus, Stage 2 suggests that the main weakness is not labeling belief content in general, but tracking the information conditions under which actor\-specific beliefs are formed and shared\. Additional diagnostics are reported in App\.[D](https://arxiv.org/html/2605.26322#A4)\.
## 7Conclusion and Limitations
#### Conclusion\.
We introduced OmniToM, a benchmark that evaluates Theory of Mind through explicit belief\-structure modeling rather than endpoint question answering\. By requiring models to extract and label multi\-actor belief structures, OmniToM makes the mental\-state representations behind social reasoning directly inspectable\. Across 895 stories and 22,343 labeled belief propositions, our results identify an actor\-specific information\-tracking bottleneck: current LLMs can often recover story facts, but struggle to determine how those facts are distributed across actors, communicated or inferred, and transformed into beliefs\. This bottleneck appears in both stages: extraction drops when models must recover actor beliefs, and labeling errors concentrate onKnowledge AccessandRepresentation\. OmniToM exposes a limitation that endpoint QA can hide: robust ToM reasoning requires tracking the information conditions that give rise to mental\-state representations\.
#### Limitations\.
OmniToM evaluates story\-based, text\-only ToM reasoning over seven retained ToMBench categories\. Its short, self\-contained narratives support controlled belief\-structure evaluation but do not cover multimodal or interactive reasoning, long\-horizon information tracking, dense temporal structure, or belief nesting beyond order 3, and the benchmark inherits ToMBench’s topical and representational limits\. The seven\-dimensional schema is human\-labeled and retains some interpretive subjectivity, which we mitigate through a closed label space, three\-annotator verification, majority voting, and expert adjudication where needed\. Stage 1 scoring uses a human\-calibrated semantic judge rather than exhaustive human adjudication; its 72\.03% human\-alignment agreement is moderate, so Stage 1F1F\_\{1\}should be treated as an approximate aggregate measure, especially for close model comparisons and implicit or differently granular beliefs\. Future work should expand to richer narratives, larger human\-audited extraction sets, multi\-judge checks, and uncertainty estimates around extraction scores\.
## References
- System Card: Claude Sonnet 4\.5\.Technical reportAnthropic\.External Links:[Link](https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3)\.
- S\. Baron\-Cohen, M\. O’Riordan, V\. Stone, R\. Jones, and K\. Plaisted \(1999\)Recognition of faux pas by normally developing children and children with asperger syndrome or high\-functioning autism\.Journal of Autism and Developmental Disorders29\(5\),pp\. 407–418\.External Links:[Document](https://dx.doi.org/10.1023/A%3A1023035012436)Cited by:[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px3.p2.1)\.
- C\. Beaudoin, É\. Leblanc, C\. Gagner, and M\. H\. Beauchamp \(2020\)Systematic review and inventory of theory of mind measures for young children\.Frontiers in Psychology10,pp\. 2905\.External Links:[Document](https://dx.doi.org/10.3389/fpsyg.2019.02905)Cited by:[§A\.1](https://arxiv.org/html/2605.26322#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§1](https://arxiv.org/html/2605.26322#S1.p3.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px3.p1.2)\.
- Z\. Chen, J\. Wu, J\. Zhou, B\. Wen, G\. Bi, G\. Jiang, Y\. Cao, M\. Hu, Y\. Lai, Z\. Xiong, and M\. Huang \(2024\)ToMBench: benchmarking theory of mind in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15959–15983\.External Links:[Link](https://aclanthology.org/2024.acl-long.847/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.847)Cited by:[§A\.1](https://arxiv.org/html/2605.26322#A1.SS1.p3.1),[§B\.6](https://arxiv.org/html/2605.26322#A2.SS6.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.26322#S2.T1.3.1.1.1.1.1.1.1.1.6.5.1),[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px3.p2.1),[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px1.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-R1\-Distill\-Qwen\-32B\.Note:Hugging Face Model CardExternal Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3)\.
- J\. H\. Flavell, F\. L\. Green, and E\. R\. Flavell \(1986\)Development of knowledge about the appearance–reality distinction\.Monographs of the Society for Research in Child Development51\(1\),pp\. 1–87\.Note:Serial No\. 212External Links:[Document](https://dx.doi.org/10.2307/1165866)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px3.p2.1)\.
- Gemma Team \(2025\)Gemma 3 Technical Report\.Note:arXiv:2503\.19786External Links:2503\.19786,[Document](https://dx.doi.org/10.48550/arXiv.2503.19786),[Link](https://arxiv.org/abs/2503.19786)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- N\. D\. Goodman and A\. Stuhlmüller \(2013\)Knowledge and implicature: modeling language understanding as social cognition\.Topics in Cognitive Science5\(1\),pp\. 173–184\.External Links:[Document](https://dx.doi.org/10.1111/tops.12007)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p2.1)\.
- Google \(2025\)Gemini 2\.5 Flash\.Note:Google AI for Developers DocumentationExternal Links:[Link](https://ai.google.dev/gemini-api/docs/models)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- F\. G\. E\. Happé \(1994\)An advanced test of theory of mind: understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults\.Journal of Autism and Developmental Disorders24\(2\),pp\. 129–154\.External Links:[Document](https://dx.doi.org/10.1007/BF02172093)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px2.p1.7)\.
- C\. Jung, D\. Kim, J\. Jin, J\. Kim, Y\. Seonwoo, Y\. Choi, A\. Oh, and H\. Kim \(2024\)Perceptions to beliefs: exploring precursory inferences for theory of mind in large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 19794–19809\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1105/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1105)Cited by:[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px1.p1.1)\.
- S\. K\. Karmaker Santu and D\. Feng \(2023\)TELeR: a general taxonomy of LLM prompts for benchmarking complex tasks\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 14197–14203\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.946/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.946)Cited by:[§C\.1](https://arxiv.org/html/2605.26322#A3.SS1.p1.1),[§C\.1](https://arxiv.org/html/2605.26322#A3.SS1.p2.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- H\. Kim, M\. Sclar, X\. Zhou, R\. Bras, G\. Kim, Y\. Choi, and M\. Sap \(2023\)FANToM: a benchmark for stress\-testing machine theory of mind in interactions\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 14397–14413\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.890/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.890)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.26322#S2.T1.3.1.1.1.1.1.1.1.1.4.3.1)\.
- M\. Le, Y\. Boureau, and M\. Nickel \(2019\)Revisiting the evaluation of theory of mind through question answering\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 5872–5877\.External Links:[Link](https://aclanthology.org/D19-1598/),[Document](https://dx.doi.org/10.18653/v1/D19-1598)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.26322#S2.T1.3.1.1.1.1.1.1.1.1.2.1.1)\.
- A\. M\. Leslie \(1987\)Pretense and representation: the origins of “theory of mind”\.Psychological Review94\(4\),pp\. 412–426\.External Links:[Document](https://dx.doi.org/10.1037/0033-295X.94.4.412)Cited by:[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px3.p2.1)\.
- Meta \(2024a\)Llama\-3\.1\-8B\-Instruct\.Note:Hugging Face Model CardExternal Links:[Link](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)Cited by:[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- Meta \(2024b\)Llama\-3\.3\-70B\-Instruct\.Note:Hugging Face Model CardExternal Links:[Link](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- Mistral AI \(2024\)Mistral Large 2\.0\.Note:Mistral AI Model CardExternal Links:[Link](https://docs.mistral.ai/models/model-cards/mistral-large-2-0-24-07)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- Mistral AI \(2025\)Mistral\-Small\-24B\-Instruct\-2501\.Note:Hugging Face Model CardExternal Links:[Link](https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501)Cited by:[§B\.2](https://arxiv.org/html/2605.26322#A2.SS2.p2.1),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2025\)GPT\-5 system card\.Note:OpenAI PublicationExternal Links:[Link](https://openai.com/blog/gpt-5-system-card/)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Perner and H\. Wimmer \(1985\)“John thinks that Mary thinks that …”: attribution of second\-order beliefs by 5\- to 10\-year\-old children\.Journal of Experimental Child Psychology39\(3\),pp\. 437–471\.External Links:[Document](https://dx.doi.org/10.1016/0022-0965%2885%2990051-7)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px2.p1.7)\.
- Z\. Pi, A\. Vadaparty, B\. K\. Bergen, and C\. R\. Jones \(2024\)Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?\.Note:arXiv:2406\.14737, version 2 revised May 27, 2025External Links:2406\.14737,[Document](https://dx.doi.org/10.48550/arXiv.2406.14737),[Link](https://arxiv.org/abs/2406.14737)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p1.1)\.
- D\. Premack and G\. Woodruff \(1978\)Does the chimpanzee have a theory of mind?\.Behavioral and Brain Sciences1\(4\),pp\. 515–526\.External Links:[Document](https://dx.doi.org/10.1017/S0140525X00076512)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p1.1)\.
- M\. Sclar, S\. Kumar, P\. West, A\. Suhr, Y\. Choi, and Y\. Tsvetkov \(2023\)Minding language models’ \(lack of\) theory of mind: a plug\-and\-play multi\-character belief tracker\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 13960–13980\.External Links:[Link](https://aclanthology.org/2023.acl-long.780/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.780)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p1.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.26322#S2.T1.3.1.1.1.1.1.1.1.1.7.6.1)\.
- S\. Tan, S\. Zhuang, K\. Montgomery, W\. Y\. Tang, A\. Cuadron, C\. Wang, R\. A\. Popa, and I\. Stoica \(2025\)JudgeBench: a benchmark for evaluating LLM\-based judges\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Tan, D\. Li, S\. Wang, A\. Beigi, B\. Jiang, A\. Bhattacharjee, M\. Karami, J\. Li, L\. Cheng, and H\. Liu \(2024\)Large language models for data annotation and synthesis: a survey\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp\. 930–957\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.54),[Link](https://aclanthology.org/2024.emnlp-main.54/)Cited by:[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Wimmer and J\. Perner \(1983\)Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception\.Cognition13\(1\),pp\. 103–128\.External Links:[Document](https://dx.doi.org/10.1016/0010-0277%2883%2990004-5)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§3](https://arxiv.org/html/2605.26322#S3.SS0.SSS0.Px2.p1.7)\.
- Y\. Wu, Y\. He, Y\. Jia, R\. Mihalcea, Y\. Chen, and N\. Deng \(2023\)Hi\-ToM: a benchmark for evaluating higher\-order theory of mind reasoning in large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Singapore,pp\. 10691–10706\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.717/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.717)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.26322#S2.T1.3.1.1.1.1.1.1.1.1.3.2.1)\.
- H\. Xu, R\. Zhao, L\. Zhu, J\. Du, and Y\. He \(2024\)OpenToM: a comprehensive benchmark for evaluating theory\-of\-mind reasoning capabilities of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 8593–8623\.External Links:[Link](https://aclanthology.org/2024.acl-long.466/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.466)Cited by:[§1](https://arxiv.org/html/2605.26322#S1.p1.1),[§1](https://arxiv.org/html/2605.26322#S1.p2.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.26322#S2.T1.3.1.1.1.1.1.1.1.1.5.4.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 Technical Report\.Note:arXiv:2505\.09388External Links:2505\.09388,[Document](https://dx.doi.org/10.48550/arXiv.2505.09388),[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4](https://arxiv.org/html/2605.26322#S4.SS0.SSS0.Px3.p1.3),[§6](https://arxiv.org/html/2605.26322#S6.SS0.SSS0.Px1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)Cited by:[§2](https://arxiv.org/html/2605.26322#S2.SS0.SSS0.Px3.p1.1)\.
## Supplementary Material
This supplement provides methodological and analysis details supporting the main paper\. We first discuss broader impacts, then provide technical appendices\. Appendix[A](https://arxiv.org/html/2605.26322#A1)defines the OmniToM schema derivation and labeling rules\. Appendix[B](https://arxiv.org/html/2605.26322#A2)documents source filtering, calibration, annotation reliability, semantic\-judge calibration, and benchmark record format\. Appendix[C](https://arxiv.org/html/2605.26322#A3)records the TELeR prompt protocol used for construction, evaluation, and semantic judging\. Appendix[D](https://arxiv.org/html/2605.26322#A4)reports extended experimental results and output audits\. Appendix[E](https://arxiv.org/html/2605.26322#A5)provides worked annotation examples for each retained story category\.
## Broader Impacts
OmniToM supports more process\-sensitive evaluation of social reasoning by shifting attention from endpoint question answering to explicit belief\-structure modeling\. This can help researchers analyze where models fail to track information access, communication, and actor\-specific beliefs, rather than treating a correct final answer as sufficient evidence of ToM competence\. By making intermediate mental\-state representations inspectable, OmniToM may support more transparent evaluation of model behavior in social\-reasoning settings\.
At the same time, OmniToM should not be used to certify real\-world social intelligence, interpersonal reliability, or deployed\-agent safety\. Story\-based benchmark scores do not establish that a model can reason appropriately in open\-ended human interaction, clinical, educational, legal, or other high\-stakes settings\. The benchmark is intended as an evaluation tool for controlled text\-based scenarios, not as evidence that a deployed system possesses human\-like social understanding\.
The benchmark also has dual\-use considerations\. Better tools for evaluating mental\-state tracking may support safer human\-AI interaction and more transparent model analysis, but they could also be used to optimize systems for persuasion, deception, or strategic modeling of users’ beliefs\. We therefore frame OmniToM as an evaluation benchmark rather than a training objective, deployment guarantee, or safety certification, and recommend reporting benchmark scores together with the limitations described in Sec\.[7](https://arxiv.org/html/2605.26322#S7)\.
## Appendix ASchema Reference and Labeling Rules
This appendix explains how OmniToM derives its operational Stage 2 schema from prior task\-level ToM ability taxonomies\. Each belief proposition receives exactly one label from each of the seven OmniToM dimensions\.
### A\.1From ATOMS Abilities to OmniToM Belief\-Level Dimensions
ATOMS \(*Abilities in Theory of Mind Space*\) is a taxonomy of Theory\-of\-Mind measures derived from a systematic review of ToM tasks and assessment instruments\[Beaudoinet al\.,[2020](https://arxiv.org/html/2605.26322#bib.bib14)\]\. The review identifies 220 measures across 830 studies and organizes them into seven broad categories: emotions, desires, intentions, percepts, knowledge, beliefs, and mentalistic understanding of non\-literal communication, with 39 finer\-grained sub\-abilities\. We use ATOMS as a coverage scaffold: it identifies the kinds of mental\-state abilities a ToM benchmark should cover, but it does not directly define labels for individual belief propositions\.
OmniToM adapts this coverage to the belief\-proposition level\. A task\-level ability describes what a story or question tests, such as false belief, hidden emotion, or non\-literal communication\. A belief\-level schema instead describes the representational properties needed inside the story: who holds a belief, whether it is true, who could know it, whether it is explicit or inferred, what it is about, how it was acquired, and what contextual framing affects it\.
ToMBench operationalizes this coverage as text\-compatible social\-reasoning stories with multiple\-choice questions, covering eight tasks and 31 abilities\[Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\]\. OmniToM uses the retained ToMBench stories as source narratives but shifts evaluation from endpoint answers to explicit belief\-structure modeling\. Rather than assigning one ability label to an item, OmniToM decomposes each story’s representational requirements into labels for relevant world\-fact and actor\-belief propositions\. Fig\.[A\.1](https://arxiv.org/html/2605.26322#A1.SS1)summarizes this derivation\. The relationship between task\-level abilities and OmniToM dimensions is many\-to\-many: one ToM ability may require several belief\-level properties, and one OmniToM dimension may support multiple abilities\.
Figure 6:ATOMS\-to\-OmniToM schema derivation\. ATOMS provides task\-level coverage for ToM abilities, ToMBench instantiates a text\-compatible subset as social\-reasoning story tasks, and OmniToM decomposes the representational requirements of those tasks into seven belief\-level labels assigned to each proposition\.False\-belief tasks illustrate the mapping\. They require tracking that an actor’s belief conflicts with narrated reality \(Truth Status\), who had access to the relevant update \(Knowledge Access\), whether the belief is stated or inferred \(Representation\), how the belief was formed \(Mental Source\), and whether it is outdated or deceptive \(Context\)\. Other task families motivate the remaining coverage: second\-order tasks requireOrderand oftenEpistemiccontent; non\-literal communication tasks motivateRepresentationandContext; and desire, intention, and emotion tasks motivate the correspondingContent Typelabels\.
Together, these examples clarify the role of ATOMS in OmniToM\. The schema does not relabel ATOMS abilities one\-to\-one; instead, it converts recurring ToM requirements into a unified per\-proposition schema for annotating the belief structure of a story\. Pure perceptual sub\-abilities that require visual stimuli are not directly retained in this text\-only benchmark, although perceptual evidence is represented when it functions as a character’sMental Source\.
### A\.2Schema Dimensions and Labeling Rules
#### Order\.
Label set:\{0,1,2,3\}\\\{0,1,2,3\\\}\. Order captures the recursive depth of the represented belief\. Order 0 is reserved for narrator or world\-level facts; Order 1 denotes an actor’s belief about the world; Order 2 denotes an actor’s belief about another actor’s belief; and Order 3 denotes a further nested belief attribution\. The order cap is based on calibration annotations: across the retained ToMBench categories, annotators found world facts, first\-order beliefs, and occasional second\- or third\-order beliefs, but no reliably grounded order\-4 cases\. Putative deeper nestings were treated as underspecified unless directly supported by the narrative\.
#### Truth Status\.
Label set:\{True,False,Unknown\}\. Truth Status labels whether the proposition is supported, contradicted, or left unresolved by the story evidence\. This label is evaluated relative to the narrative, not relative to the believer’s confidence\. Annotators useUnknownwhenever the story does not verify or contradict the proposition; the annotation rule prefersUnknownover speculative completion\.
#### Knowledge Access\.
Label set:\{Private,Shared,Public\}\. Knowledge Access captures how the proposition is distributed across actors\.Privateis used for internal or unshared beliefs, even when the underlying event itself is public\.Sharedis used when the proposition is mutually available within a subset of actors through communication, joint perception, or obvious mutual awareness\.Publicis reserved for common ground across all relevant actors\. This dimension is especially sensitive to whether the proposition is explicitly communicated or only inferred from context\.
#### Representation\.
Label set:\{Explicit,Implicit\}\. Representation distinguishes directly stated beliefs from beliefs inferred from actions, perception, pragmatic cues, or broader context\.Explicitis used when the belief\-relevant content is directly narrated, directly spoken, or directly attributed as a mental state\.Implicitis used when the annotator must infer the belief from the story’s social or pragmatic context\.
#### Content Type\.
Label set:\{Location,Contents/Physical State,Identity/Relation,Epistemic,Desire/Intention,Emotion,Trait/Value,Action/Event\}\. Content Type identifies what the proposition is about\. Location labels where an entity is or was; Contents/Physical State labels what a container holds or the physical condition of an object; Identity/Relation labels actor identities or relationships; Epistemic labels beliefs about knowledge, awareness, or other beliefs; Desire/Intention labels goals, wants, plans, or intended outcomes; Emotion labels affective states; Trait/Value labels preferences, dispositions, or evaluations; and Action/Event labels happenings or actions\.
#### Mental Source\.
Label set:\{Narration,Perception,Memory,Testimony,Inference,Imagination,Unknown\}\. Mental Source tracks how the belief was acquired\.Narrationis reserved for world\-level facts\.Perceptionis used when an actor directly observes the relevant event or state\.Memoryis used when the belief depends on a prior perceived state\.Testimonyis used when the belief comes from another actor’s utterance\.Inferenceis used when the belief must be derived from actions, social cues, or context\.Imaginationcovers hypothetical or imagined content, andUnknownis used when the source is not recoverable from the story\.
#### Context\.
Label set:\{Deceptive,Temporal,Counterfactual,Neutral\}\. Context captures special framing conditions that affect interpretation\.Temporalmarks stale, recalled, or past\-state beliefs, regardless of whether the proposition is ultimately true or false\.Deceptiveis used for beliefs shaped by lying, concealment, or deliberate misdirection\.Counterfactualis used for hypothetical, pretense, or non\-actual framing\.Neutralis used when none of these special conditions applies\.
## Appendix BBenchmark Construction and Validation
This appendix section consolidates the methodological support for benchmark construction: retained source categories, the human\-calibrated annotation pipeline, calibration\-time model selection, annotation reliability, semantic\-judge calibration, and the benchmark record format\.
### B\.1Source Filtering, Retained Categories, and Benchmark Statistics
OmniToM begins from the 1,383\-story ToMBench source corpus and retains seven high\-volume categories whose stories provide sufficiently self\-contained mental\-state evidence for explicit belief extraction from story text alone\. These categories account for 916 source stories\. We exclude categories whose items often require underspecified causal, affective, or world\-knowledge completion rather than explicit actor\-attributed belief reconstruction\. In particular,Unexpected Outcomecontains many stories, but its items frequently hinge on explaining surprising outcomes from sparse context, making the relevant mental states less consistently recoverable as faithful belief propositions\. After holding out 21 calibration stories, the benchmark contains 895 stories and 22,343 labeled belief propositions\. Table[4](https://arxiv.org/html/2605.26322#A2.T4)reports the retained category counts, belief counts, and belief\-order distribution\.
Table 4:OmniToM benchmark category statistics\.
### B\.2Human\-Calibrated Annotation Pipeline
Benchmark construction must balance semantic rigor with scalable annotation\. Exhaustively extracting multi\-order belief structures and labeling every belief under the schema introduces substantial cognitive load for human annotators at scale\. We therefore adopt a human\-calibrated pipeline that uses human\-annotated gold structures to guide validation and calibration, while separating annotation generation from semantic evaluation to reduce self\-enhancement bias\. Prompt design and the prompt\-level calibration sweep are reported separately in App\.[C](https://arxiv.org/html/2605.26322#A3)\.
To support validation and calibration, we reserve a 21\-story subset consisting of three stories from each retained category\. For Stage 1 extraction, two domain experts independently constructed belief propositions for each story and then reconciled disagreements to produce a human consensus structure\. For Stage 2 labeling, we use Mistral\-Small 24B\[Mistral AI,[2025](https://arxiv.org/html/2605.26322#bib.bib26)\]only to pre\-populate candidate labels under strict schema constraints, reducing annotator burden without delegating final decisions to the model\. Three annotators then verified these labels, resolving disagreements via majority vote or expert adjudication\.
#### Human annotation effort estimate\.
Benchmark development involved a multi\-month annotation effort by 11 project annotators, totaling approximately 1\.1K person\-hours of annotation\-related work\. This effort covered schema refinement, pilot annotation, belief extraction, schema labeling, semantic\-alignment annotation, calibration, adjudication, and validation before the final LLM\-assisted pipeline was fixed\. The estimate is based on assignment logs and includes annotation work directly supporting the benchmark; it excludes general research time such as paper writing, experiment execution, model analysis, and project meetings\. These hours reflect the human work required to define, calibrate, validate, and adjudicate the annotation process before scaling the finalized pipeline to the 895\-story benchmark split\.
### B\.3Calibration Model Selection
With the semantic judge fixed, we evaluated candidate annotation models on the 21\-story calibration subset using the same Stage 1F1F\_\{1\}and Stage 2 accuracy metrics used in the benchmark\. Claude\-Sonnet\-4\.5 achieved the strongest combined calibration performance, with 72\.88 Stage 1F1F\_\{1\}and 93\.62 Stage 2 accuracy, and was therefore selected as the benchmark annotation model\. Table[5](https://arxiv.org/html/2605.26322#A2.T5)reports the full calibration comparison\.
Table 5:Calibration model\-selection results on the 21\-story subset \(%\)\. ST1: extractionF1F\_\{1\}; ST2: belief\-labeling accuracy\.#### Stage 2 label\-normalization note\.
For Stage 2 labeling evaluation, predicted labels are canonicalized only when they are valid variants of legal schema labels \(e\.g\., trimmed strings, prefixed forms, or known aliases\)\. Predictions that remain noisy or non\-label text after normalization are counted as incorrect\.
### B\.4Human Annotation Reliability and Judge Calibration
The reliability analysis covers the human and judge\-calibration checks used to support benchmark construction and Stage 1 evaluation\. It quantifies agreement for Stage 1 belief extraction after expert reconciliation, reports Stage 2 label reliability across annotators, and assesses candidate semantic judges against human semantic\-alignment decisions\.
For Stage 1 extraction, two domain experts independently extracted belief structures for the calibration stories\. Because extraction is open\-ended, we report an overlap\-style reconciliation metric over the reviewed belief set\. LetRRbe the set of reviewed Stage 1 belief propositions, and letmi=1m\_\{i\}=1when beliefiirequired no reconciliation remark in theBelief Remarksfield andmi=0m\_\{i\}=0otherwise\. Stage 1 overlap is
OverlapS1=1\|R\|∑i∈Rmi\.\\mathrm\{Overlap\}\_\{S1\}=\\frac\{1\}\{\|R\|\}\\sum\_\{i\\in R\}m\_\{i\}\.On the 21\-story calibration subset,360360of430430reviewed beliefs required no reconciliation remark, yielding 83\.72% overlap\. Reconciliation remarks record additions, deletions, and revisions used to produce the final consensus structure\.
Table 6:Stage 2 human annotation reliability on the 21\-story calibration subset\.Dim\.NNAgreeAgr\. \(%\)Order39038297\.95Truth Status39033886\.67Knowledge Access39033585\.90Representation39038097\.44Content Type39034688\.72Mental Source39036894\.36Context39036994\.62Overall2,7302,51892\.23
For Stage 2 labeling, three annotators verified schema labels across the same calibration subset\. Agreement is computed as strict all\-annotator exact match over aligned beliefs and label cells\. Table[6](https://arxiv.org/html/2605.26322#A2.T6)reports the resulting per\-dimension reliability, with 2,518 of 2,730 label cells in full agreement overall \(92\.23%\)\. In Table[6](https://arxiv.org/html/2605.26322#A2.T6),NNdenotes label cells, Agree denotes strict three\-annotator exact matches, and Agr\. denotes the agreement rate\. These gold annotations are then used as the merged source for benchmark construction\.
#### Semantic judge calibration\.
Stage 1 benchmark evaluation requires semantic alignment between predicted and gold belief propositions, where exact string matching is too strict\. We therefore compared candidate semantic judges against human alignment decisions under the MatchCount protocol\. The calibration set contains 126 human\-evaluated prediction tables, evenly sampled from three extraction models: Qwen3 32B, Gemma\-3 27B, and Mistral\-Large 123B\. Across gold and prediction tables, the merged annotation set contains 4,715 beliefs\. Human annotators reached 88\.86% strict agreement on jointly annotated semantic\-alignment decisions \(3,955/4,451\), providing the human comparison set used for judge calibration\.
For each storyss, letU\(s\)U^\{\(s\)\}denote the set of normalized\(Actor, Belief\)pairs on which both the human annotation and candidate judge decision are defined\. Story\-level human–judge agreement is
Agreement\(s\)=1\|U\(s\)\|∑\(a,b\)∈U\(s\)𝟏\[yH\(s\)\(a,b\)=yJ\(s\)\(a,b\)\],\\mathrm\{Agreement\}^\{\(s\)\}=\\frac\{1\}\{\|U^\{\(s\)\}\|\}\\sum\_\{\(a,b\)\\in U^\{\(s\)\}\}\\mathbf\{1\}\\\!\\left\[y\_\{H\}^\{\(s\)\}\(a,b\)=y\_\{J\}^\{\(s\)\}\(a,b\)\\right\],whereyH\(s\)\(a,b\)∈\{0,1\}y\_\{H\}^\{\(s\)\}\(a,b\)\\in\\\{0,1\\\}is the human binary alignment decision,yJ\(s\)\(a,b\)∈\{0,1\}y\_\{J\}^\{\(s\)\}\(a,b\)\\in\\\{0,1\\\}is the judge decision, and a positive decision corresponds toMatchCount\(a,b\)\>0\\mathrm\{MatchCount\}\(a,b\)\>0\. Candidate judges are compared by macro\-averagingAgreement\(s\)\\mathrm\{Agreement\}^\{\(s\)\}across calibration samples\. Under the TELeR Level 4 judge prompt, GPT\-5 achieved the highest human–judge agreement among the evaluated candidates \(72\.03%\), followed by Gemini\-2\.5 Flash \(71\.10%\), Claude\-Sonnet\-4\.5 \(68\.57%\), Llama\-3\.3 70B \(64\.48%\), and DeepSeek\-R1\-Distill\-Qwen 32B \(63\.65%\)\. This agreement is moderate rather than definitive, so Stage 1F1F\_\{1\}should be interpreted as an approximate aggregate extraction metric\. We therefore fix GPT\-5 as the human\-calibrated semantic judge for Stage 1 evaluation\.
#### Human MatchCount distribution\.
We also inspected the human semantic\-alignment labels to verify that the one\-to\-manyMatchCountrange used by the judge prompt was sufficient for the calibration setting\. Across the 126 human\-evaluated prediction tables, the merged human alignment set contains 4,715 beliefs\. TheMatchCountdistribution isMC=0: 42\.46% \(2,002 beliefs\),MC=1: 56\.22% \(2,651\),MC=2: 1\.29% \(61\), andMC=3: 0\.02% \(1\)\. Human alignment produced no cases requiring a count above 3, supporting the conservative 2–3 range used for compound belief matches\.
### B\.5Benchmark Record Format
The benchmark is organized as one JSON object per story, with fields for story metadata, belief propositions, and seven\-dimensional schema labels\. Public release of the dataset and accompanying code is forthcoming\. Table[7](https://arxiv.org/html/2605.26322#A2.T7)defines the planned record fields\.
Table 7:Benchmark record format planned for public release\.
### B\.6Licensing and Release Notes
OmniToM reuses story text from ToMBench\[Chenet al\.,[2024](https://arxiv.org/html/2605.26322#bib.bib8)\]\. The public ToMBench repository distributes its materials under the MIT License\. The forthcoming OmniToM dataset and code release is intended to preserve upstream attribution for the source stories and to distribute OmniToM annotations, prompt builders, replication code, dataset\-card documentation, and Croissant metadata with Responsible AI fields under the MIT License\.
Third\-party model APIs and hosted inference services used during construction or evaluation will not be redistributed as part of the dataset or code artifacts; users who extend the future public runner to such services must follow the corresponding provider terms\. Following ToMBench’s benchmark\-use caution, we recommend using OmniToM for evaluation rather than training to reduce benchmark contamination, although this recommendation is advisory and separate from the data/software license\.
## Appendix CPrompt Engineering via TELeR Taxonomy
This appendix records the TELeR prompt taxonomy used to specify OmniToM prompt families, followed by the core prompt protocol used in benchmark construction, zero\-shot evaluation, and semantic\-judge calibration\. The prompt bodies below print the L3 base prompts and shared L4 augmentation blocks, indicating where category\-conditioned instructions and few\-shot in\-context examples are inserted\. Full example bodies are omitted for space; worked annotation examples appear in App\.[E](https://arxiv.org/html/2605.26322#A5)\.
### C\.1TELeR Prompt Taxonomy
TELeR defines a prompt for a complex task as the combination of a*directive*and the associated*data*supplied to the model\[Karmaker Santu and Feng,[2023](https://arxiv.org/html/2605.26322#bib.bib9)\]\. The taxonomy categorizes the directive along four dimensions:*Turn*, distinguishing single\-turn from multi\-turn prompting;*Expression*, distinguishing question\-style from instruction\-style directives;*Level of Details*, measuring how much task structure is provided; and*Role*, distinguishing prompts with a defined system role from prompts without one\.
The Level\-of\-Details axis ranges from Level 0, where no directive is given beyond the task data, to Level 6, where a complex directive includes a high\-level goal, explicit sub\-tasks, evaluation or few\-shot guidance, additional retrieved information, and an explicit request for output justification\. OmniToM uses system\-role defined, single\-turn, instruction\-style prompts\. The directives are task\-oriented rather than question\-based because benchmark construction and evaluation require structured tables rather than conversational answers\.
Figure 7:TELeR prompt taxonomy fromKarmaker Santu and Feng \[[2023](https://arxiv.org/html/2605.26322#bib.bib9)\]\. TELeR categorizes prompts byTurn,Expression,Level of Details, andRole\. Reproduced unchanged under[CC BY 4\.0](https://creativecommons.org/licenses/by/4.0/)\.
For benchmark construction, we adopt TELeR Level 4 prompts\. Level 4 is the most appropriate point on the taxonomy for OmniToM because belief extraction, belief labeling, and semantic matching all require explicit goals, decomposed subtasks, output/evaluation criteria, and examples of the expected structure\. Level 5 would introduce retrieved external information, which is inappropriate because all evidence must come from the story and supplied belief tables\. Level 6 would solicit explanations or justifications, which conflicts with the table\-only output format needed for reliable parsing and evaluation\. For zero\-shot benchmark evaluation, we use Level 3 base prompts without few\-shot examples, so models are evaluated on task understanding from the directive and subtask decomposition alone\.
### C\.2Stage 1 Belief\-Extraction Prompt
#### Usage in the benchmark\.
Benchmark construction and calibration use a TELeR Level 4 extraction prompt with a category\-conditioned evaluation block and category\-matched few\-shot examples\. Zero\-shot benchmark evaluation uses the Level 3 base prompt without few\-shot examples\.
Table 8:Category\-conditioned Stage 1 extraction emphases used in Level 4 construction prompts\.Stage 1 Extraction: Level 3 Zero\-Shot Base PromptYou are a Theory of Mind expert whose task is to extract multi\-order actor beliefs from the narrative and output a table with columns Actor, Belief, and Order by performing the following steps\. A belief is a minimal proposition expressing what an actor takes to be true\.1\. Identify narrated events and states that the story presents as facts, and record them as world\-level beliefs attributed to the special actor ’world’ \(order 0\)\.2\. Identify all actors, including characters or groups, who appear in the narrative and are capable of holding beliefs\.3\. For each actor, extract beliefs about the narrated events or states of the world, and record them as first\-order beliefs \(order 1\)\.4\. For each actor, extract beliefs about other actors’ beliefs, applying this notion recursively for nested beliefs, and record them as higher\-order beliefs \(order 2 or higher\)\.
Stage 1 Extraction: Shared Level 4 Content BlockA good output should satisfy the following:\- Include only beliefs grounded in the narrative; do not invent actors, events, or beliefs\.\- Express each belief as a single, atomic, declarative proposition; split compound statements into separate rows\.\- Use third\-person language only and resolve pronouns to exact actor names as they appear in the narrative\.
Stage 1 Extraction: Category\-Conditioned Level 4 InsertionA category\-conditioned instruction block is inserted here to name the story category and highlight the relevant extraction emphasis\.Categories: Ambiguous Story Task; False Belief Task; Faux\-pas Recognition Test; Hinting Task Test; Persuasion Story Task; Scalar Implicature Test; Strange Story Task\.
Stage 1 Extraction: Level 4 Formatting BlockA correctly formatted output should satisfy the following:\- Present the result as a pipe\-separated table using the ’\|’ character\.\- The first row must be exactly: Actor \| Belief \| Order\- Order rows with world\-level beliefs first when present, then beliefs grouped by actor; within each actor group, order beliefs chronologically by the story’s event sequence\.Few\-shot: Category\-matched in\-context examples are then provided to illustrate the expected extraction format; see App\.[E](https://arxiv.org/html/2605.26322#A5)for worked examples\.
### C\.3Stage 2 Belief\-Labeling Prompt
#### Usage in the benchmark\.
Benchmark construction and calibration use a TELeR Level 4 belief\-labeling prompt with category\-matched labeled few\-shot tables\. Zero\-shot benchmark evaluation uses the Level 3 base prompt without few\-shot examples\.
Stage 2 Belief Labeling: Level 3 Zero\-Shot Base PromptYou are a Theory of Mind expert whose task is to label a table of actor beliefs, given a narrative, by assigning a label from each of the following closed sets—Order \(0/1/2/3\), Truth\-Status \(True/False/Unknown\), Knowledge\-Access \(Private/Shared/Public\), Representation \(Explicit/Implicit\), Content Type \(Location, Contents/Physical State, Identity/Relation, Epistemic, Desire/Intention, Emotion, Trait/Value, Action/Event\), Mental\-Source \(Narration, Perception, Memory, Testimony, Inference, Imagination, Unknown\), and Context \(Deceptive, Temporal, Counterfactual, Neutral\)—and outputting only a table with columns Actor and Belief, followed by one column for each labeling set\.In this context, a belief is a minimal proposition expressing what an actor takes to be true about the world or about another actor’s mental state\. Label each belief in the provided table by assigning values for the following dimensions, using the narrative as evidence:1\. Determine the Order of the belief, which captures the depth of belief reasoning:\- Order 0: Narrator\- or world\-level facts that anchor the story’s ground truth and are not held by any actor\.\- Order 1: First\-order beliefs \(A believes p\)\.\- Order 2: Second\-order beliefs \(A believes B believes p\)\.\- Order 3: Higher\-order recursive beliefs \(A believes B believes C believes p\)\.2\. Determine the Truth\-Status of the belief relative to the narrative:\- True if the belief is verified or entailed by the narration\.\- False if the belief is contradicted by the narration\.\- Unknown if the narrative does not provide sufficient evidence\.3\. Determine the Knowledge\-Access of the belief by assessing who could realistically know it in the story world:\- Private if the belief is held internally without evidence others know it\.\- Shared if it is mutually known within a subgroup through explicit acknowledgment or obvious mutual awareness\.\- Public if it is common ground across all actors \(announced, jointly witnessed, or mutually known to be mutually known\)\.4\. Determine the Representation of the belief:\- Explicit if the belief is directly stated, spoken, or narrated as a mental state\.\- Implicit if the belief must be inferred from actions, perception, or context\.5\. Determine the Content Type by identifying what the proposition is about:\- Use Action/Event for happenings; Desire/Intention for plans or goals\.\- Use Location when the proposition concerns where an entity is or was, even if it involves a container\- Use Contents / Physical State only when the belief concerns what a container holds or an object’s condition\.\- Use Epistemic for beliefs about beliefs, knowledge, attention, or awareness\.6\. Determine the Mental\-Source of the belief, indicating how it was acquired:\- Narration \(Order 0 only\), Perception, Memory, Testimony, Inference, Imagination, or Unknown\.7\. Determine the Context of the belief:\- Deceptive if shaped by lying, omission, or misdirection\.\- Temporal if the belief is outdated or reflects recall of a prior true state\.\- Temporal \+ False indicates an outdated false belief\.\- Temporal \+ True indicates accurate recall of a past fact\.\- Counterfactual if the belief occurs in a hypothetical or pretense frame\.\- Neutral if none apply\.
Stage 2 Belief Labeling: Level 4 Addition BlockA good output should satisfy the following:\- Assign labels only using evidence grounded in the narrative; do not introduce new interpretations, entities, beliefs, or label values\.\- Assign exactly one valid label from each closed set to every belief line\.\- Use only the predefined label sets for Order, Truth\-Status, Knowledge\-Access, Representation, Content Type, Mental\-Source, and Context\.\- If evidence is insufficient or ambiguous, select Unknown rather than speculate\.\- Do not include explanations, notes, justifications, or commentary beyond the required labels\.A correctly formatted output should satisfy the following:\- Present the result as a pipe\-separated table using the ’\|’ character\.\- Each subsequent row must correspond to one belief from the input belief table\.\- Preserve the original order of beliefs from the input table\.\- Do not include free\-text explanations, narrative descriptions, or additional columns\.Few\-shot: Category\-matched in\-context examples are then provided to illustrate the expected label choices and table format; see App\.[E](https://arxiv.org/html/2605.26322#A5)for worked examples\.
### C\.4Semantic Judge Prompt
#### Usage in the benchmark\.
The semantic judge is not used to generate benchmark annotations\. It is used for Stage 1 semantic alignment during calibration and evaluation\. For zero\-shot evaluation, the selected GPT\-5 judge uses the TELeR Level 4 family with appended few\-shot alignment examples\. Candidate judge models are compared on the 21\-story calibration subset across TELeR Levels 1–4, and the final evaluation protocol fixes the Level 4 prompt family reproduced below\.
Semantic Judge: Level 3 Zero\-Shot Base PromptYou are a Theory of Mind evaluation expert whose task is to semantically match rows between two belief tables \(Prediction, Ground Truth\) extracted from the same short Story Narrative and output only the two tables, explicitly labeled “Prediction Table” and “Ground Truth Table,” with an added MatchCount column indicating how many distinct semantically equivalent rows exist in the other table for the same Actor\. In this context, a Belief is a minimal statement of what an actor takes to be true about the world \(facts/events\) or about other actors’ mental states, expressed in natural language\. Perform the task by following these steps:1\) If a Story Narrative is provided, use it only to resolve ambiguity \(pronouns, aliases, implicit entities\) and paraphrase meaning; if no narrative is provided, ignore narrative context entirely\. In all cases, do not add rows and do not introduce new beliefs that are not present in either table\.2\) Treat Actors as distinct mental agents and normalize only cosmetic variants of the same Actor name \(case/spacing/punctuation and clear shortenings\); never merge different Ground Truth Actors\.3\) Handle the special actor ’world’ first: treat ’world’ as the key for narrated facts and events, and align world\-level beliefs conservatively, typically one\-to\-one, allowing only minor normalization differences\.4\) Restrict candidate matches to the same Actor group after normalization; if the Actor does not match, the row cannot match regardless of belief similarity\.5\) Default to one\-to\-one with bookkeeping: if \(and only if\) there exists a clear semantically equivalent belief for the same Actor, assign the row its single best match among currently\-unmatched target rows; otherwise assign no match \(MatchCount = 0\)\. If multiple rows compete for the same target row, keep only the closest semantic match and force the others to choose different unmatched targets or become 0\.6\) Allow one\-to\-many only for compound rows: if a row clearly contains multiple independent beliefs, you may align it to 2–3 different rows in the other table within the same Actor group, but only if each aligned target row captures a distinct part of the compound meaning\.7\) Ensure symmetry: after completing matches for Prediction rows, also compute MatchCount for every Ground Truth row using the same alignment decisions\.
Semantic Judge: Level 4 Addition BlockA good output should satisfy the following:\- Only compare beliefs inside the same Actor group; Ground Truth Actors are unique and must not be merged—if Actor differs, it is NOT a match even if belief text is identical\.\- The special actor ’world’ represents narrated facts and events \(not a character in the story\); world\-level beliefs should generally align one\-to\-one across tables with only minor normalization differences\.\- Match only beliefs that are semantically equivalent; do not invent or force alignments and do not introduce beliefs that are not present in either table\.\- Use conservative alignment: when multiple rows are ’close’, prefer the best single match and leave other rows unmatched rather than double\-counting\.\- Prefer one\-to\-one matches; allow one\-to\-many \(MatchCount 2–3\) only for genuinely compound beliefs\.\- Do not match across different Actors, even if text looks similar\.\- Output exactly two tables labeled Prediction and Ground Truth \(in that order\), with CSV header Actor,Belief,MatchCount and no extra text\.A correctly formatted output should satisfy the following:\- Output exactly two tables labeled Prediction and Ground Truth \(in that order\)\.\- Each table must be comma\-separated with header: Actor,Belief,MatchCount\- Preserve original row order and add exactly one MatchCount column\.\- Do not include explanations or any text outside the tables\.Few\-shot: In\-context alignment examples are then provided to illustrate the matching process\.
### C\.5Prompt\-Level Calibration Results
Table[9](https://arxiv.org/html/2605.26322#A3.T9)reports the prompt\-selection sweep that motivates the final Level 4 prompt family used for benchmark construction and semantic\-judge calibration\. The sweep compares TELeR Levels 1–4 on the 21\-story calibration subset and shows consistent gains from the more explicit Level 4 format across extraction, labeling, and judge agreement\.
Table 9:Prompt\-engineering results across TELeR levels on the 21\-story calibration subset \(%\)\. Level 4 uses in\-context examples\.
## Appendix DExtended Experiments and Audits
This appendix collects the experimental diagnostics and output audits used to interpret the aggregate results in Sec\.[6](https://arxiv.org/html/2605.26322#S6)\. It includes a Stage 1 extraction\-volume audit, the full Stage 2 category\-by\-dimension breakdown, the Stage 1MatchCountdistribution audit, and unusable\-output statistics\.
### D\.1Stage 1 Extraction Volume Audit
Figure[8](https://arxiv.org/html/2605.26322#A4.F8)audits whether Stage 1 performance is primarily explained by output volume\. For each model, the figure compares the average number of generated belief propositions per story with extraction Precision, Recall, andF1F\_\{1\}\. This view helps separate two effects: models that generate too few beliefs tend to under\-recover the gold structure, while higher\-volume generation can improve recall but may introduce redundant or unsupported beliefs\. The pattern indicates that output volume affects the precision\-recall trade\-off, but it does not by itself explain failures in actor\-specific belief recovery\.
Figure 8:Stage 1 extraction\-volume audit\. Average generated beliefs per story are plotted against Precision, Recall, andF1F\_\{1\}\. The figure shows that output volume affects the precision\-recall trade\-off, but volume alone does not explain failures in actor\-specific belief recovery\.
### D\.2Extended Stage 2 Belief\-Labeling Results
TableLABEL:tab:stage2\-dim\-category\-mergedexpands the aggregate Stage 2 results by reporting per\-dimension accuracy within each retained story category\. This table supports per\-dimension analysis beyond Table[3](https://arxiv.org/html/2605.26322#S6.T3), showing whether a model’s aggregate labeling accuracy is driven by particular schema dimensions or story types\.
Table 10:Stage 2 belief\-labeling accuracy \(%\)\. Consolidated view of all seven schema dimensions across the seven story categories\. Best is bold and second\-best is underlined \(ties share the same formatting\)\. Category abbreviations: AST \(Ambiguous Story Task\), FBT \(False Belief Task\), FPT \(Faux\-pas Recognition Test\), HT \(Hinting Task Test\), PST \(Persuasion Story Task\), SIT \(Scalar Implicature Test\), SST \(Strange Story Task\)\.ModelASTFBTFPTHTPSTSITSSTOrderLlama\-3\.1 8B64\.0059\.6977\.4170\.2075\.9481\.3869\.44Llama\-3\.3 70B88\.2099\.7991\.7089\.4489\.4096\.9592\.31Mistral\-Large 123B95\.54100\.0095\.7295\.9696\.90100\.0096\.58Mistral\-Small 24B93\.34100\.0092\.3492\.4893\.5699\.4694\.38Qwen3 8B57\.1470\.5471\.0180\.4382\.0873\.2076\.67Gemini\-2\.5 Flash92\.0599\.7793\.8390\.3794\.9499\.3296\.46Gemma\-3 27B95\.0499\.9195\.5095\.2398\.1499\.3194\.28GPT\-595\.7299\.9093\.7090\.6795\.7197\.6593\.84Qwen3 32B95\.88100\.0095\.2694\.5295\.6699\.9194\.46Truth StatusLlama\-3\.1 8B58\.8253\.9572\.8356\.9467\.4272\.7167\.32Llama\-3\.3 70B83\.5988\.9187\.2579\.0786\.8071\.3888\.16Mistral\-Large 123B86\.9284\.7687\.4885\.9185\.8084\.4888\.68Mistral\-Small 24B79\.2591\.0985\.1268\.6584\.2676\.4587\.39Qwen3 8B55\.0967\.9465\.0368\.3671\.8964\.7173\.04Gemini\-2\.5 Flash78\.1691\.7388\.3471\.6288\.8179\.3591\.56Gemma\-3 27B86\.4586\.4986\.3383\.7486\.0066\.3285\.68GPT\-589\.3089\.8181\.4184\.1181\.9875\.2582\.39Qwen3 32B86\.0084\.7986\.1584\.1983\.0071\.4284\.16Knowledge AccessLlama\-3\.1 8B49\.4342\.9651\.6557\.1970\.5164\.9954\.68Llama\-3\.3 70B65\.6654\.1053\.0374\.4892\.5363\.5772\.02Mistral\-Large 123B71\.7856\.4954\.0879\.9789\.4176\.8785\.27Mistral\-Small 24B73\.4355\.7654\.1979\.9787\.9977\.9686\.57Qwen3 8B45\.3939\.4743\.8272\.5777\.2553\.9069\.14Gemini\-2\.5 Flash66\.3263\.0964\.1378\.3690\.7168\.5672\.12Gemma\-3 27B68\.3754\.7254\.3579\.0691\.6776\.3176\.25GPT\-572\.9458\.0859\.0073\.8086\.1255\.0269\.88Qwen3 32B71\.7557\.3454\.5279\.9287\.6977\.0584\.30RepresentationLlama\-3\.1 8B58\.0254\.7456\.4268\.2274\.6270\.1166\.54Llama\-3\.3 70B61\.1458\.9479\.8076\.8483\.4270\.6773\.05Mistral\-Large 123B68\.3980\.3068\.7878\.8869\.1480\.7467\.30Mistral\-Small 24B44\.2785\.7675\.8568\.0956\.8359\.6554\.39Qwen3 8B48\.2951\.0856\.6476\.3172\.2464\.0971\.67Gemini\-2\.5 Flash90\.0596\.3479\.4292\.9486\.4883\.1289\.14Gemma\-3 27B41\.9048\.3345\.7856\.6574\.4059\.9254\.20GPT\-577\.6888\.8873\.3290\.9786\.7179\.5088\.22Qwen3 32B59\.0351\.3775\.5171\.3455\.7757\.2263\.04Content TypeLlama\-3\.1 8B42\.6957\.4243\.0350\.0845\.3150\.3050\.78Llama\-3\.3 70B74\.3587\.5159\.5174\.5266\.3775\.3972\.59Mistral\-Large 123B83\.3986\.7176\.3284\.4676\.3589\.6182\.44Mistral\-Small 24B73\.9978\.9668\.1977\.6672\.9986\.0374\.09Qwen3 8B47\.1858\.4645\.2959\.5052\.6442\.1556\.77Gemini\-2\.5 Flash82\.8490\.1085\.4185\.6575\.8993\.0085\.55Gemma\-3 27B73\.2581\.1771\.6077\.4365\.3167\.4277\.80GPT\-580\.9986\.5384\.4078\.7373\.0477\.6478\.90Qwen3 32B75\.7791\.7664\.8475\.1468\.8461\.4170\.56Mental SourceLlama\-3\.1 8B53\.8251\.1548\.7462\.8365\.2747\.1359\.70Llama\-3\.3 70B79\.1687\.1077\.0584\.1076\.9663\.2776\.78Mistral\-Large 123B91\.9191\.0078\.9591\.9888\.6681\.0686\.62Mistral\-Small 24B89\.5089\.9975\.9090\.1584\.8082\.0685\.78Qwen3 8B52\.7463\.7648\.0174\.7274\.2553\.7867\.19Gemini\-2\.5 Flash86\.9386\.6975\.2289\.3884\.7481\.0487\.06Gemma\-3 27B83\.7088\.6666\.2882\.1676\.3878\.1680\.07GPT\-589\.6886\.1680\.6587\.9683\.1874\.0284\.24Qwen3 32B85\.0084\.7562\.8783\.4769\.2175\.6580\.12ContextLlama\-3\.1 8B72\.0959\.4786\.5680\.1187\.1282\.1570\.07Llama\-3\.3 70B94\.2584\.0998\.5494\.4298\.6790\.6585\.54Mistral\-Large 123B96\.5388\.7198\.6594\.1698\.7490\.7287\.78Mistral\-Small 24B96\.1084\.2198\.3994\.5798\.6890\.7385\.46Qwen3 8B62\.1960\.9276\.5485\.0788\.7875\.3673\.38Gemini\-2\.5 Flash94\.6586\.7097\.1595\.2797\.5485\.7990\.73Gemma\-3 27B94\.8688\.4598\.3694\.2698\.8190\.0985\.40GPT\-595\.0190\.0091\.3191\.5095\.6683\.0283\.48Qwen3 32B95\.5984\.7095\.3293\.2398\.6689\.5984\.38Table 10:Stage 2 belief\-labeling accuracy \(%\)\. \(continued\)
### D\.3Stage 1 MatchCount Distribution
Table[11](https://arxiv.org/html/2605.26322#A4.T11)reports the raw predicted\-sideMatchCountdistribution from the full Stage 1 benchmark runs\. The distribution is dominated by unmatched and one\-to\-one alignments, while compound alignments are rare\. Stage 1 precision and recall use only whetherMatchCountis nonzero, so larger raw counts do not receive additional weight\. Counts above 3 are therefore reported as an output audit of the semantic judge rather than as a distinct scoring condition\.
Table 11:Predicted\-sideMatchCountdistribution in the full Stage 1 benchmark runs\. Counts are computed from model prediction tables evaluated by the fixed GPT\-5 semantic judge\.
### D\.4Unusable\-Output Audit
Table 12:Unusable\-output audit across benchmark runs \(%\)\.The largest coverage loss occurs for Qwen3 8B, with 62 unusable Stage 1 samples and 129 unusable Stage 2 samples out of the 895\-story benchmark\. Coverage failures are concentrated in smaller open\-weight models, suggesting difficulty sustaining the required structured output format across benchmark\-scale prompting\. Because we do not instrument generation traces, we treat these cases as format and coverage failures rather than evidence for a specific mechanism such as context pressure or completion\-budget exhaustion\. Closed\-source failures are sparse and non\-systematic in this audit\. In contrast, Llama\-3\.3 70B, Mistral\-Large 123B, Mistral\-Small 24B, and Gemma\-3 27B achieve complete Stage 2 coverage, indicating that the benchmark protocol is stable once models can reliably sustain the required output format\.
## Appendix EWorked Annotation Examples by Category
This appendix provides one worked annotation example for each retained benchmark category, using the final Level 4 demonstration format from benchmark construction\. Each table preserves the full story text together with the gold OmniToM beliefs and schema labels, grouped by actor\. Compact display labels are used for space:IdentityabbreviatesIdentity/Relation,PhysicalabbreviatesContents/Physical State,DesireabbreviatesDesire/Intention,TraitabbreviatesTrait/Value, andActionabbreviatesAction/Event\. App\.[A](https://arxiv.org/html/2605.26322#A1)lists the full closed\-set labels\.
Table 13:Ambiguous Story Task\. Ambiguous nonverbal coordination around a shared plan\.Story: Mingfeng, Xiaoyu, and Xiaolin are good friends\. This afternoon, Mingfeng watches Xiaoyu and her newly adopted pet dog play under the shade of the trees in the community\. Suddenly, he winks at Xiaolin, who is next to him, and looks at the nearby pet supply store\. Xiaolin looks at Xiaoyu, then responds to Mingfeng’s gaze\. After that, she stands up and walks towards the pet supply store\.Belief PropositionOrdTruthAccessRepr\.TypeSourceContextNarrated Facts \(world\)Mingfeng, Xiaoyu, and Xiaolin are good friends0TruePublicExplicitIdentityNarrationNeutralMingfeng watches Xiaoyu and her newly adopted pet dog play under the shade of the trees in the community0TruePublicExplicitActionNarrationNeutralMingfeng winks at Xiaolin0TruePublicExplicitActionNarrationNeutralMingfeng looks at the nearby pet supply store0TruePublicExplicitActionNarrationNeutralXiaolin looks at Xiaoyu0TruePublicExplicitActionNarrationNeutralXiaolin responds to Mingfeng’s gaze0TruePublicExplicitActionNarrationNeutralXiaolin stands up and walks towards the pet supply store0TruePublicExplicitActionNarrationNeutralActor Beliefs \(Mingfeng\)Xiaoyu has a newly adopted pet dog1TruePublicImplicitIdentityPerceptionNeutralXiaoyu is playing with her pet dog under the shade of the trees1TruePublicImplicitActionPerceptionNeutralThe pet supply store is nearby1TruePrivateImplicitLocationPerceptionNeutralXiaolin is next to Mingfeng1TruePublicImplicitLocationPerceptionNeutralMingfeng’s wink will signal Xiaolin about going to the pet supply store1UnknownPrivateImplicitEpistemicInferenceNeutralXiaolin will understand the wink as a suggestion to go to the pet supply store2UnknownPrivateImplicitEpistemicInferenceNeutralActor Beliefs \(Xiaolin\)Mingfeng, Xiaoyu, and Xiaolin are good friends1TruePrivateImplicitIdentityMemoryNeutralXiaoyu has a newly adopted pet dog1TruePublicImplicitIdentityPerceptionNeutralXiaoyu is playing with her pet dog under the shade of the trees1TruePublicImplicitActionPerceptionNeutralThe pet supply store is nearby1TrueSharedExplicitLocationPerceptionNeutralMingfeng winks at Xiaolin1TrueSharedExplicitActionPerceptionNeutralMingfeng looks at the pet supply store1TrueSharedImplicitActionPerceptionNeutralMingfeng wants to go to the pet supply store to buy pet toys1UnknownPrivateImplicitDesireInferenceNeutralMingfeng is signaling Xiaolin to go to the pet supply store1TruePrivateImplicitEpistemicInferenceNeutralGoing to the pet supply store will be helpful for Xiaoyu or her dog1UnknownPrivateImplicitDesireInferenceNeutralActor Beliefs \(Xiaoyu\)Mingfeng, Xiaoyu, and Xiaolin are good friends1TruePublicImplicitIdentityMemoryNeutralMingfeng is watching Xiaoyu and the dog1TruePrivateImplicitActionPerceptionNeutralXiaolin is looking at Xiaoyu1TruePrivateImplicitActionPerceptionNeutralMingfeng winked at Xiaolin1TruePrivateImplicitActionPerceptionNeutralMingfeng looked at the pet supply store1TruePrivateImplicitActionPerceptionNeutralXiaolin is walking towards the pet supply store1TruePrivateImplicitActionPerceptionNeutralMingfeng and Xiaolin are doing something confusing1UnknownPrivateImplicitActionPerceptionNeutralTable 13:Ambiguous Story Task\. Ambiguous nonverbal coordination around a shared plan\. \(continued\)Table 14:False Belief Task\. Hidden transfer and outdated belief\.Story: Alice and Bob are in a room\. There is an object in a box\. Bob leaves\. Alice moves the object to the safe\.Belief PropositionOrdTruthAccessRepr\.TypeSourceContextNarrated Facts \(world\)Alice is in the room0TruePublicExplicitLocationNarrationNeutralBob is in the room0TruePublicExplicitLocationNarrationTemporalThere is an object in a box0TruePublicExplicitLocationNarrationTemporalThere is a safe in the room0TruePublicExplicitLocationNarrationNeutralBob leaves the room0TruePublicExplicitActionNarrationNeutralAlice moves the object to the safe0TruePrivateExplicitActionNarrationNeutralThe object is in the safe0TruePrivateExplicitLocationNarrationNeutralActor Beliefs \(Alice\)Bob is in the room1TruePrivateImplicitLocationPerceptionTemporalThere is a box in the room1TruePublicImplicitLocationPerceptionNeutralThere is a safe in the room1TruePublicImplicitLocationPerceptionNeutralThe object is in the box1TruePublicImplicitLocationPerceptionTemporalBob left the room1TruePrivateImplicitActionPerceptionNeutralThe object is in the safe1TruePrivateImplicitLocationPerceptionNeutralBob thinks the object is in the box2TruePrivateImplicitEpistemicInferenceNeutralBob thinks Alice thinks the object is in the box3TruePrivateImplicitEpistemicInferenceNeutralActor Beliefs \(Bob\)Alice is in the room1TruePrivateImplicitLocationPerceptionNeutralThere is a box in the room1TruePublicImplicitLocationPerceptionNeutralThere is a safe in the room1TruePublicImplicitLocationPerceptionNeutralThe object is in the box1FalsePrivateImplicitLocationPerceptionTemporalAlice thinks the object is in the box2FalsePrivateImplicitEpistemicInferenceTemporalAlice thinks Bob thinks the object is in the box3TruePrivateImplicitEpistemicInferenceNeutralTable 14:False Belief Task\. Hidden transfer and outdated belief\. \(continued\)Table 15:Faux\-pas Recognition Test\. Social expectations around a promise to attend a game\.Story: On Saturday morning, Xiao Wang and Xiao Zhao meet at the school gate\. Xiao Zhao says: "I have a basketball game this afternoon, will you come to watch?" Xiao Wang remembers Xiao Zhao’s game, and replies: "Of course, I will definitely go to support you\." Xiao Zhao happily says: "Great, thank you\!"Belief PropositionOrdTruthAccessRepr\.TypeSourceContextNarrated Facts \(world\)Xiao Wang and Xiao Zhao meet at the school gate on Saturday morning0TruePublicExplicitActionNarrationNeutralXiao Zhao says, "I have a basketball game this afternoon, will you come to watch?"0TruePublicExplicitActionNarrationNeutralXiao Wang remembers Xiao Zhao’s game0TruePublicExplicitActionNarrationNeutralXiao Wang replies, "Of course, I will definitely go to support you\."0TruePublicExplicitActionNarrationNeutralXiao Zhao says, "Great, thank you\!"0TruePublicExplicitActionNarrationNeutralActor Beliefs \(Xiao Zhao\)Xiao Zhao has a basketball game this afternoon1TruePublicExplicitActionMemoryNeutralXiao Wang might come to watch the basketball game1TruePrivateImplicitDesireInferenceNeutralXiao Wang will come to support Xiao Zhao1TruePrivateImplicitDesireTestimonyNeutralXiao Wang remembers the basketball game1TruePrivateImplicitEpistemicTestimonyNeutralXiao Wang wants to support Xiao Zhao1UnknownPrivateImplicitDesireInferenceNeutralActor Beliefs \(Xiao Wang\)Xiao Zhao has a basketball game this afternoon1TruePublicExplicitActionMemoryNeutralXiao Wang will go to support Xiao Zhao1TruePublicExplicitDesireMemoryNeutralXiao Zhao thinks Xiao Wang remembers the basketball game2UnknownPrivateImplicitEpistemicInferenceNeutralXiao Zhao thinks Xiao Wang will come to watch2TruePrivateImplicitEpistemicInferenceNeutralTable 15:Faux\-pas Recognition Test\. Social expectations around a promise to attend a game\. \(continued\)Table 16:Hinting Task Test\. Indirect birthday hint about wanting a dog\.Story: Rebecca’s birthday is coming soon\. She says to her father, "I like animals, especially dogs\."Belief PropositionOrdTruthAccessRepr\.TypeSourceContextNarrated Facts \(world\)Rebecca’s birthday is coming soon0TruePublicExplicitActionNarrationNeutralRebecca says to Rebecca’s father, "I like animals, especially dogs\."0TruePublicExplicitActionNarrationNeutralActor Beliefs \(Rebecca\)Rebecca’s birthday is coming soon1TruePrivateImplicitActionMemoryNeutralA dog would be a good birthday gift1UnknownPrivateImplicitDesireInferenceNeutralRebecca’s father can buy Rebecca a dog1UnknownPrivateImplicitDesireInferenceNeutralRebecca’s father will understand that Rebecca wants a dog as a birthday gift1UnknownPrivateImplicitEpistemicInferenceNeutralRebecca’s father thinks Rebecca likes dogs2UnknownPrivateImplicitEpistemicInferenceNeutralActor Beliefs \(Rebecca’s father\)Rebecca likes animals1TruePublicExplicitTraitTestimonyNeutralRebecca especially likes dogs1TruePublicExplicitTraitTestimonyNeutralRebecca implies Rebecca wants a dog as a birthday gift1UnknownPrivateImplicitDesireInferenceNeutralRebecca thinks Rebecca’s father should get Rebecca a dog as a good birthday gift2UnknownPrivateImplicitEpistemicInferenceNeutralTable 16:Hinting Task Test\. Indirect birthday hint about wanting a dog\. \(continued\)Table 17:Persuasion Story Task\. A larger\-office goal requiring influence\.Story: Xiao Hong wants to move to a bigger office, but that office is occupied by her colleague Xiao Li\.Belief PropositionOrdTruthAccessRepr\.TypeSourceContextNarrated Facts \(world\)Xiao Hong wants to move to a bigger office0TruePublicExplicitDesireNarrationNeutralThe bigger office is occupied by Xiao Li0TruePublicExplicitLocationNarrationNeutralXiao Hong and Xiao Li are colleagues0TruePublicExplicitIdentityNarrationNeutralActor Beliefs \(Xiao Hong\)Xiao Hong needs the bigger office that Xiao Li occupies1TruePrivateImplicitDesireInferenceNeutralXiao Hong must persuade Xiao Li to give up the bigger office1TruePrivateImplicitDesireInferenceNeutralXiao Li will agree to exchange offices if Xiao Hong offers convenient conditions1UnknownPrivateImplicitEpistemicInferenceNeutralTable 17:Persuasion Story Task\. A larger\-office goal requiring influence\. \(continued\)Table 18:Scalar Implicature Test\. Approximate quantity reasoning over white chickens\.Story: On a farm, Farmer Wang keeps 15 chickens, almost a third of which are white\. He counts some of them and finds that 4 are white\.Belief PropositionOrdTruthAccessRepr\.TypeSourceContextNarrated Facts \(world\)Farmer Wang is on a farm0TruePublicExplicitLocationNarrationNeutralFarmer Wang keeps 15 chickens0TruePublicExplicitPhysicalNarrationNeutralAlmost a third of the chickens are white0TruePublicExplicitPhysicalNarrationNeutralFarmer Wang counts some of the chickens0TruePublicExplicitActionNarrationNeutralFarmer Wang finds that 4 are white0TruePublicExplicitPhysicalNarrationNeutralActor Beliefs \(Farmer Wang\)There are 15 chickens1TruePublicImplicitPhysicalPerceptionNeutralBefore counting, almost a third of the chickens are white1TruePublicImplicitPhysicalPerceptionNeutralBefore counting, probably 5 chickens are white1UnknownPrivateImplicitPhysicalInferenceNeutralAfter counting some of the chickens, probably 5 chickens are white in total1UnknownPrivateImplicitPhysicalInferenceNeutralTable 18:Scalar Implicature Test\. Approximate quantity reasoning over white chickens\. \(continued\)Table 19:Strange Story Task\. Nonliteral reassurance about coughing\.Story: Emma coughs\. Throughout lunchtime, she keeps coughing\. Dad says, "Poor Emma, you must have a frog in your throat\!"Belief PropositionOrdTruthAccessRepr\.TypeSourceContextNarrated Facts \(world\)Emma keeps coughing throughout lunchtime0TruePublicExplicitActionNarrationNeutralDad says, "Poor Emma, you must have a frog in your throat\!"0TruePublicExplicitActionNarrationNeutralActor Beliefs \(Emma\)Dad says Emma has a frog in her throat1TruePrivateExplicitActionPerceptionNeutralDad thinks Emma does not actually have a frog in her throat2TruePrivateImplicitEpistemicInferenceNeutralDad thinks Emma’s cough sounds like a frog’s call2TruePrivateImplicitEpistemicInferenceNeutralDad wants to make Emma laugh and feel better1UnknownPrivateImplicitDesireInferenceNeutralDad thinks the joke will make Emma laugh and feel better1UnknownPrivateImplicitDesireInferenceNeutralActor Beliefs \(Dad\)Emma keeps coughing during lunchtime1TruePrivateImplicitActionPerceptionNeutralEmma does not actually have a frog in her throat1TruePrivateImplicitPhysicalInferenceNeutralEmma’s cough sounds like a frog’s call1UnknownPrivateImplicitPhysicalPerceptionNeutralSaying "Poor Emma, you must have a frog in your throat\!" will make Emma laugh and feel better1UnknownPrivateImplicitDesireInferenceNeutralEmma thinks Dad is trying to make Emma laugh and feel better2UnknownPrivateImplicitEpistemicInferenceNeutralTable 19:Strange Story Task\. Nonliteral reassurance about coughing\. \(continued\)Similar Articles
Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning
Proposes Agent-ToM, a learning-to-monitor framework using Theory-of-Mind reasoning to detect covert malicious behavior in autonomous LLM agents by inferring beliefs and intents, outperforming baseline monitors.
When Should Models Change Their Minds? Contextual Belief Management in Large Language Models
This paper introduces Contextual Belief Management (CBM) for LLMs to handle long-term information, proposes the BeliefTrack benchmark for evaluation, and demonstrates that reinforcement learning and representation-level steering significantly reduce belief management failures.
OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind
This paper presents OSCToM, an RL-guided method for generating adversarial data to test nested belief conflicts in LLMs, improving Theory of Mind reasoning on benchmarks like FANToM.
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
This study presents a 33-model atlas analyzing domain-level metacognitive monitoring in frontier LLMs using MMLU benchmarks, revealing significant variations in confidence calibration across different knowledge domains that are obscured by aggregate metrics.
Can LLMs model real-world systems in TLA+?
Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.