MindZero: Learning Online Mental Reasoning With Zero Annotations

arXiv cs.AI 06/02/26, 04:00 AM Papers
Summary
MindZero introduces a self-supervised reinforcement learning framework that trains multimodal large language models for efficient and robust online mental reasoning without requiring mental state annotations, outperforming model-based methods in accuracy and efficiency.
arXiv:2606.00240v1 Announce Type: new Abstract: Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs' intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:45 PM
# MindZero: Learning Online Mental Reasoning With Zero Annotations
Source: [https://arxiv.org/html/2606.00240](https://arxiv.org/html/2606.00240)
###### Abstract

Effective real\-world assistance requires AI agents with robust Theory of Mind \(ToM\): inferring human mental states from their behavior\. Despite recent advances, several key challenges remain, including \(1\) online inference with robust uncertainty updates over multiple hypotheses; \(2\) efficient reasoning suitable for real\-time assistance; and \(3\) the lack of ground\-truth mental state annotations in real\-world domains\. We address these challenges by introducingMindZero, a self\-supervised reinforcement learning framework that trains multimodal large language models \(MLLMs\) for efficient and robust online mental reasoning\. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model\-based ToM reasoning\. This method thus eliminates the need for explicit mental state annotations\. After training,MindZerointernalizes model\-based reasoning into fast single\-pass inference\. We evaluateMindZeroagainst baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains\. We found that LLMs alone are insufficient; model\-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity\. In contrast,MindZeroenhances MLLMs’ intrinsic ToM ability and significantly outperforms model\-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self\-supervised skill\.

Theory of Mind, Reinforcement Learning, Multimodal Large Language Models, Mental Reasoning, AI Assistance

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.00240v1/x1.png)Figure 1:An example of online mental reasoning for proactive assistance, where the helper agent simultaneously infers the the main agent’s goal and helps to reach the goal faster\. As shown in this example, the helper observes the main agent’s actions over time,MindZerocontinuously updates a probability distribution over multiple goal hypotheses\. Based on the multiple possible hypotheses maintained at each step, the helper decides whether to act and proactively assists by fetching relevant tableware and placing it on the table\. As new actions are observed, the probabilities of different mental state hypotheses are updated over time\. In particular, the transition from step 2 to step 3 shows that the main agent grabbing a second plate increases the likelihood of the second hypothesis at step 2\.To proactively assist human users in the real world, AI agents must understand users’ minds and anticipate their needs\. This requires strong Theory of Mind \(ToM\), i\.e\., the ability to infer users’ mental states \(such as desires, beliefs, and goals\) from their behavior\. Recent advances in large language models \(LLMs\) and multimodal LLMs have sparked growing interest in machine Theory of Mind\(Wimmer and Perner,[1983](https://arxiv.org/html/2606.00240#bib.bib63); Ullman,[2023](https://arxiv.org/html/2606.00240#bib.bib51); Wilfet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib54); Sclaret al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib42); Jinet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib36)\)\. However, much of the existing work focuses on question\-answering\-based ToM evaluation and development, which is insufficient for real\-world assistance\. In practice, an assistive agent must continuously update its inferences about a user’s mental state and track uncertainty over multiple competing hypotheses\. This form of online mental\-state reasoning can guide agent planning, enabling proactive assistance, adaptation to changing contexts, and more effective collaboration with users\.

For instance, in Figure[1](https://arxiv.org/html/2606.00240#S1.F1), as the agent observes a human’s actions in a household setting, it maintains and updates a probability distribution over multiple possible goal hypotheses in real time, and uses these hypotheses to decide when and how to proactively help \(e\.g\., fetching tableware before the user asks\)\.

However, training models for online mental reasoning remains challenging\. Human mental states are latent and often ambiguous\. They are also dynamically changing over time in sequential tasks\. For many real\-world applications, such as household or web assistance, it is extremely difficult and costly to collect large\-scale training data with reliable annotations of ground\-truth mental states\. As a result, prior works on learning\-based ToM methods have been limited to controlled settings\(Rabinowitzet al\.,[2018](https://arxiv.org/html/2606.00240#bib.bib41); Rhinehartet al\.,[2019](https://arxiv.org/html/2606.00240#bib.bib67); Bortolettoet al\.,[2024a](https://arxiv.org/html/2606.00240#bib.bib4),[b](https://arxiv.org/html/2606.00240#bib.bib66)\), lacking open\-endedness and scalability\.

To circumvent these data and annotation challenges, recent work has explored inference\-time reasoning methods that leverage the generality and strong reasoning ability of LLMs for ToM, without requiring model training\. In particular, when integrated with model\-based ToM methods, such as Bayesian inverse planning \(BIP\), inference\-time scaling has demonstrated strong performance on challenging ToM reasoning tasks\(Jinet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib36); Shiet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib37); Zhanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib35); Yinget al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib65); Kimet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib38)\)\. These methods leverage LLMs to propose and evaluate mental state hypotheses, achieving robust and scalable mental reasoning\. However, they are computationally prohibitive in online mental reasoning required for real\-world assistance tasks\. These challenges call for a new type of ToM approach that retains the deliberative structure of model\-based reasoning while better leveraging the efficiency and learning capacity of LLMs\.

To address these limitations, we introduceMindZero, a novel Theory of Mind reasoning framework that trains multimodal language models to perform robust and efficient online mental reasoning without requiring mental state annotations\. During training, the model explicitly generates hypotheses about mental states \(e\.g\., beliefs and goals\) and is rewarded when these hypotheses assign high likelihood to the actions people actually take\. We term this Self\-Supervised Reinforcement Learning \(SSRL\)\. Unlike common RL\-based language model training, the reward in our SSRL method is computed entirely from self\-supervised signals\. It encourages the model to produce explicit mental state hypotheses with robust uncertainty estimates\. This method eliminates the need for ground\-truth mental state labels, allowing the model to learn directly from behavior and internalize ToM reasoning patterns that explain actions in context\. The trainedMindZeromodel infers mental states in a single forward pass, while remaining grounded in a model\-based objective that preserves robustness and interpretability\.

In our experiments, we comparedMindZeroagainst state\-of\-the\-art ToM methods on question answering and proactive assistance tasks in both gridworld\(Jhaet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib18)\)and household environments\(Puiget al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib15)\)\. Small multimodal language models trained with ourMindZeromethod significantly outperformed baselines in all tasks, matching the robustness of model\-based methods while significantly reducing the computational cost\. We further validateMindZeroin an IRB\-approved human study, where it delivers effective real\-time assistance to human users using a small open\-weight backbone\. These results suggest that mental reasoning can be learned as a self\-supervised skill, narrowing the gap between robust but slow model\-based inference and fast but error\-prone reasoning by a small multimodal language model\.

In sum, our main contributions include: \(1\) a self\-supervised RL method,MindZero, that trains multimodal language models to conduct robust and efficient online mental reasoning without mental state annotations; \(2\) systematic evaluation ofMindZeroand recent ToM methods in a suite of challenging online mental reasoning and proactive AI assistance benchmarks\.

## 2Related Work

#### Theory of Mind Methods\.

Existing methods for ToM reasoning fall into three main categories\. \(1\)Prompting\-basedapproaches\(Junget al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib43); Huanget al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib55); Yuet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib57); Zhouet al\.,[2025a](https://arxiv.org/html/2606.00240#bib.bib56); Houet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib44); Sclaret al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib42)\)improve upon base LLMs but still exhibit systematic errors in long\-context understanding, complex behaviors, and recursive reasoning\. \(2\)Model\-basedapproaches, especially Bayesian inverse planning \(BIP\)\(Bakeret al\.,[2009](https://arxiv.org/html/2606.00240#bib.bib45); Ullmanet al\.,[2009](https://arxiv.org/html/2606.00240#bib.bib46)\), explicitly model agents’ mental states and their influence on behavior\. Recent work integrates BIP with LLMs\(Jinet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib36); Shiet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib37); Zhanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib35)\), combining structured reasoning with flexible language understanding\. However, these methods are often computationally expensive, as they require searching large hypothesis spaces at test time\. \(3\)Learning\-basedmethods train neural networks for mental\-state inference\(Rabinowitzet al\.,[2018](https://arxiv.org/html/2606.00240#bib.bib41); Lianget al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib17); Sclaret al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib39); Luet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib68)\), but they rely on costly and unreliable ground\-truth annotations, limiting their scalability and applicability\. To address these limitations,MindZerolearns mental reasoning directly from human behavior data\. Our approach improves over prompting\-based methods, avoids the computational overhead of model\-based inference, and eliminates the need for explicit mental state annotations required by prior learning\-based approaches\.

#### ToM\-Guided Assistance

Recent work on ToM has been mainly focused on question\-answering tasks\(Leet al\.,[2019](https://arxiv.org/html/2606.00240#bib.bib58); Gandhiet al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib59); Kimet al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib60); Wuet al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib61); Xuet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib62); Jinet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib36); Shiet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib37); Bortolettoet al\.,[2025a](https://arxiv.org/html/2606.00240#bib.bib72); Fanet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib53)\), where ToM models answer questions about mental states based on a story and/or a video\. In contrast, ToM\-guided assistance is more challenging: models must continuously infer and update mental states while accounting for uncertainty over long horizons to support effective assistance\. Prior work has explored Theory of Mind guided assistance\(Puiget al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib15); Yinget al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib2); Zhi\-Xuanet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib20); Zhouet al\.,[2025b](https://arxiv.org/html/2606.00240#bib.bib16); Jinet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib73),[2026](https://arxiv.org/html/2606.00240#bib.bib74)\)where an agent helps a human based on its understanding of the human’s mind across domains such as games, household environments, coding, and real\-world LLM conversations\. Other work studies assistants supporting teams with shared goals\(Seoet al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib6); Zhanget al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib7)\)or partially divergent goals\(Bortolettoet al\.,[2025b](https://arxiv.org/html/2606.00240#bib.bib8)\)through intervention and coordination\. A further line focuses on situated natural\-language collaboration with rich social dynamics\(Liuet al\.,[2012](https://arxiv.org/html/2606.00240#bib.bib9); Chaiet al\.,[2014](https://arxiv.org/html/2606.00240#bib.bib10); Suhret al\.,[2019](https://arxiv.org/html/2606.00240#bib.bib11); Narayan\-Chenet al\.,[2019](https://arxiv.org/html/2606.00240#bib.bib12); Jayannavaret al\.,[2020](https://arxiv.org/html/2606.00240#bib.bib13); Baraet al\.,[2021](https://arxiv.org/html/2606.00240#bib.bib71); Bortolettoet al\.,[2025a](https://arxiv.org/html/2606.00240#bib.bib72)\)\. Although there has been prior work on online mental reasoning shown to be effective in ToM\-guided assistance\(e\.g\., Puiget al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib15); Wanget al\.,[2021](https://arxiv.org/html/2606.00240#bib.bib21); Shvoet al\.,[2022](https://arxiv.org/html/2606.00240#bib.bib22); Zhi\-Xuanet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib20); Yinget al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib2); Crosset al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib1); Maet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib3)\), they have strong assumptions about human behavior and/or require high computational costs for complex tasks\.MindZerodirectly targets this gap by training a small multimodal language model to efficiently and robustly conduct online mental reasoning that can support downstream assistance tasks in a scalable way\.

![Refer to caption](https://arxiv.org/html/2606.00240v1/x2.png)\(a\)Self\-Supervised Reinforcement Learning\.
![Refer to caption](https://arxiv.org/html/2606.00240v1/x3.png)\(b\)Reward Computation\.

Figure 2:\(a\) Overview of our Self\-Supervised Reinforcement Learning \(SSRL\) framework\. Givenstatess1:ts\_\{1:t\}andactionsa1:ta\_\{1:t\}up to timesteptt, the model outputs a set ofNNmental state hypothesesmt1:Nm\_\{t\}^\{1:N\}along with theirprobabilitiesqt1:Nq\_\{t\}^\{1:N\}\. Unlike standard RL\-based language model training, SSRL derives rewards entirely from self\-supervised signals based on observations and model outputs, which are used to guide GRPO updates\. \(b\) Reward computation in SSRL\. Given the model outputs, an action likelihood evaluator \(either an LLM or a model\-based planner\) estimatesthe likelihood of the observed actionunder each mental state hypothesis, andmental priorsare estimated as the likelihood of proposed hypotheses by an LLM or set uniformly\. The reward is computed as the probability\-weighted log\-likelihood of the observed action and mental state hypotheses minus an entropy regularization term\.

## 3Problem Formulation

We formalize the problem of online mental state inference \(Section[3\.1](https://arxiv.org/html/2606.00240#S3.SS1)\) and characterize how inferred mental states can be leveraged to enable proactive assistance \(Section[3\.2](https://arxiv.org/html/2606.00240#S3.SS2)\)\. Our formulation provides a unified probabilistic framework for reasoning about users’ latent beliefs and goals from sequential observations, and for translating this uncertainty\-aware reasoning into effective assistive decision making in dynamic environments\.

### 3\.1Online Mental Reasoning

Given a sequence of observed user behavior up to time steptt, including statess1:ts\_\{1:t\}and actionsa1:ta\_\{1:t\}, a ToM model infers the latest mental state of the usermtm\_\{t\}, which could include different mental variables such as beliefsbtb\_\{t\}and goalsgtg\_\{t\}\. Inspired by Bayesian inverse planning \(BIP\)\(Bakeret al\.,[2009](https://arxiv.org/html/2606.00240#bib.bib45),[2017](https://arxiv.org/html/2606.00240#bib.bib47); Zhi\-Xuanet al\.,[2020](https://arxiv.org/html/2606.00240#bib.bib48)\), a model\-based ToM inference method, we formalize online mental state inference as following Bayesian inference:

P\(mt∣s1:t,a1:t\)⏟posterior∝P\(a1:t∣mt,s1:t\)⏟action likelihood⋅P\(mt\)⏟prior,\\underbrace\{P\(m\_\{t\}\\mid s\_\{1:t\},a\_\{1:t\}\)\}\_\{\\text\{posterior\}\}\\propto\\underbrace\{P\(a\_\{1:t\}\\mid m\_\{t\},s\_\{1:t\}\)\}\_\{\\text\{action likelihood\}\}\\cdot\\underbrace\{P\(m\_\{t\}\)\}\_\{\\text\{prior\}\},\(1\)
Unlike prior work by\(Zhi\-Xuanet al\.,[2020](https://arxiv.org/html/2606.00240#bib.bib48)\), this formulation goes beyond the typical Markovian assumptions behind BIP, modeling all past behavior jointly\. In real\-world domains, this Bayesian inference can be computationally intractable due to an infinite hypothesis space and costly action likelihood estimation \(which is achieved via forward planning conditioned on hypothetical mental states\)\. OurMindZeromethod aims to overcome these computational bottlenecks by training a multimodal language model to directly output quality hypothesis samples and their posterior probabilities without explicit Bayesian inference\.

### 3\.2Proactive Assistance Guided by Online Mental Reasoning

In online mental reasoning, the model must continuously update multiple mental state hypotheses\{mt\}\\\{m\_\{t\}\\\}at every stepttand estimate their probabilities\{qt\}\\\{q\_\{t\}\\\}given a user’s behavior history\(s1:t,a1:t\)\(s\_\{1:t\},a\_\{1:t\}\)\. Given the top hypotheses of a user’s mental state, an assistive agent can then plan for the assistive actions to best help the user\. LetatAa^\{A\}\_\{t\}be the assistive action at time steptt\. We define the assistive agent’s policy as

P\(atA∣s1:t,a1:t\)=∑mtP\(atA∣st,mt\)P\(mt\|s1:t,a1:t\)\.P\(a^\{A\}\_\{t\}\\mid s\_\{1:t\},a\_\{1:t\}\)=\\sum\_\{m\_\{t\}\}P\(a^\{A\}\_\{t\}\\mid s\_\{t\},m\_\{t\}\)P\(m\_\{t\}\|s\_\{1:t\},a\_\{1:t\}\)\.\(2\)
Such assistive decision making must consider the uncertainty in the mental inference, which requires a robust estimate of the confidence of multiple hypotheses\. It also needs to frequently update plans based on the most recent user behavior, and thus needs a fast inference to support real\-time replanning\.MindZeroaims to achieve this via training a small multimodal language model with low computational cost and latency\.

## 4MindZero

We introduceMindZero, a self\-supervised reinforcement learning framework that trains multimodal language models to perform efficient and robust online mental reasoning\.MindZerolearns directly from behavioral data using self\-supervised signals, addressing the lack of ground\-truth mental state labels in real\-world domains \(Section[4\.1](https://arxiv.org/html/2606.00240#S4.SS1)and Figure[2\(a\)](https://arxiv.org/html/2606.00240#S2.F2.sf1)\)\. The core ofMindZerois its reward design: the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions, as estimated by a model\-based planner or an LLM, in a manner similar to model\-based ToM reasoning \(Section[4\.2](https://arxiv.org/html/2606.00240#S4.SS2)and Figure[2\(b\)](https://arxiv.org/html/2606.00240#S2.F2.sf2)\)\. Through this process,MindZerointernalizes the Bayesian inverse planning procedure in Equation \([1](https://arxiv.org/html/2606.00240#S3.E1)\) and enables real\-time planning for proactive assistance as in Equation \([2](https://arxiv.org/html/2606.00240#S3.E2)\)\.

### 4\.1Self\-Supervised RL for Mental Reasoning

Standard supervised approaches to mental reasoning rely on ground\-truth mental state annotations, which are scarce and difficult to collect\. Existing self\-supervised methods for sequential modeling, such as next\-token prediction\(Bengioet al\.,[2003](https://arxiv.org/html/2606.00240#bib.bib28); Radfordet al\.,[2018](https://arxiv.org/html/2606.00240#bib.bib29)\)and autoregressive trajectory modeling\(Chenet al\.,[2021](https://arxiv.org/html/2606.00240#bib.bib30)\), emphasize forward prediction and learn by mimicking future words or actions from past context\. In contrast, mental reasoning requires inverse modeling: explicitly inferring the mental state that causes the observed behavior\. This capability is not explicitly learned by existing self\-supervised objectives, which are optimized for prediction rather than explanation\.

To bridge this gap, we formulate mental reasoning as a self\-supervised reinforcement learning \(SSRL\) problem centered on explanatory consistency\. Instead of treating actions as prediction targets, we view them as evidence\. InMindZero, the model is rewarded not for predicting actions directly, but for generating mental state hypotheses that maximize the likelihood of user actions, thereby providing coherent explanations of agent behavior\. As illustrated in Figure[2\(a\)](https://arxiv.org/html/2606.00240#S2.F2.sf1), unlike common RL\-based language model training, the reward in our SSRL method is entirely calculated via self\-supervised signals from user behavior \(without ground\-truth mental state annotations\) and model outputs\. Based on this reward, we then use GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib24); Guoet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib25)\)to train the model, closing the self\-supervised learning loop\.

### 4\.2Reward Design

Formally, given a sequence of user behavior\(s1:t,a1:t\)\(s\_\{1:t\},a\_\{1:t\}\), we optimize a multimodal language modelQθQ\_\{\\theta\}to approximate the posterior of mental statesmtm\_\{t\}via variational inference\(Bishop,[2006](https://arxiv.org/html/2606.00240#bib.bib34)\)\. As traversing the full hypothesis space is intractable, we maximize the Evidence Lower Bound \(ELBO\)\(Kingma and Welling,[2014](https://arxiv.org/html/2606.00240#bib.bib31)\)\. The optimization objective can be formalized as the following reward function:

𝒥\(θ\)=𝔼Qθ\[log⁡\(P\(a1:t∣mt,s1:t\)⋅P\(mt\)\)\]\+H\(Qθ\),\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{Q\_\{\\theta\}\}\[\\log\(P\(a\_\{1:t\}\\mid m\_\{t\},s\_\{1:t\}\)\\cdot P\(m\_\{t\}\)\)\]\+H\(Q\_\{\\theta\}\),\(3\)
where thePPterms denote estimators of theaction likelihoodandmental state priorin Equation \([1](https://arxiv.org/html/2606.00240#S3.E1)\); andH\(Qθ\)H\(Q\_\{\\theta\}\)is the entropy ofQθQ\_\{\\theta\}\. In particular, the entropy term encourages exploration over mental state hypotheses and prevents premature collapse to a single mode, thereby promoting robust and diverse posterior approximations\.

In practice, the modelQθ\(⋅∣s1:t,a1:t\)Q\_\{\\theta\}\(\\cdot\\mid s\_\{1:t\},a\_\{1:t\}\)generates a finite set ofNNmental state hypothesesℳt=\{mt\(1\),…,mt\(N\)\}\\mathcal\{M\}\_\{t\}=\\\{m^\{\(1\)\}\_\{t\},\\dots,m^\{\(N\)\}\_\{t\}\\\}, along with their normalized posterior probabilities𝒬t=\{qt\(1\),…,qt\(N\)\}\\mathcal\{Q\}\_\{t\}=\\\{q^\{\(1\)\}\_\{t\},\\dots,q^\{\(N\)\}\_\{t\}\\\}such that∑i=1Nq\(i\)=1\\sum\_\{i=1\}^\{N\}q^\{\(i\)\}=1\. We treat theseNNcandidates as the effective support of the variational posterior\. Consequently, the likelihood, prior, and entropy terms in Equation \([3](https://arxiv.org/html/2606.00240#S4.E3)\) are computed as weighted sums:

R\(ℳt,𝒬t\)=∑i=1Nqt\(i\)\[log⁡\(P\(a1:t∣mt\(i\),s1:t\)⋅P\(mt\(i\)\)\)\]−∑i=1Nqt\(i\)log⁡qt\(i\)\.\\begin\{split\}R\(\\mathcal\{M\}\_\{t\},\\mathcal\{Q\}\_\{t\}\)=&\\sum\_\{i=1\}^\{N\}q\_\{t\}^\{\(i\)\}\[\\log\(P\(a\_\{1:t\}\\mid m^\{\(i\)\}\_\{t\},s\_\{1:t\}\)\\cdot P\(m\_\{t\}^\{\(i\)\}\)\)\]\\\\ &\-\\sum\_\{i=1\}^\{N\}q\_\{t\}^\{\(i\)\}\\log q\_\{t\}^\{\(i\)\}\.\\end\{split\}\(4\)
Action Likelihood\.Action likelihood measures how probable the observed actions are under a given mental state hypothesis\. Specifically,Pt\(i\)=P\(a1:t∣mt\(i\),s1:t\)P\_\{t\}^\{\(i\)\}=P\(a\_\{1:t\}\\mid m^\{\(i\)\}\_\{t\},s\_\{1:t\}\)computes the likelihood of the action sequence up to timett, given the observed statess1:ts\_\{1:t\}and a proposed mental state hypothesismt\(i\)m^\{\(i\)\}\_\{t\}\. This likelihood can be estimated using either a model\-based planner \(as in the GridWorld domain in Section[5\.1](https://arxiv.org/html/2606.00240#S5.SS1)and[5\.2](https://arxiv.org/html/2606.00240#S5.SS2)\) or an LLM \(as in the Household domain in Section[5\.3](https://arxiv.org/html/2606.00240#S5.SS3)and[5\.4](https://arxiv.org/html/2606.00240#S5.SS4)\)\.

Mental State Prior\.Mental state priorP\(mt\)P\(m\_\{t\}\)represents the prior probabilities assigned to different mental state hypothesesmtm\_\{t\}\. These priors can be either uniform or non\-uniform to incorporate prior knowledge from symbolic rules or LLMs, helping constrain the hypothesis space\. For example, in a household environment, goals such as placing food into a dishwasher or setting the table with vastly mismatched numbers of plates and cutlery would be assigned a low prior probability\. This effectively prevents the model from generating hypotheses that violate common sense at the proposal stage\.

In summary, to produce hypotheses with high action likelihoods, high mental state priors, and consequently, high rewards, the proposed mental states must be explicit and meaningful for both estimators for the action likelihood and the mental state prior\. This then encourages the model to learn to propose explicit and meaningful mental states through RL training\. In the meantime, with the entropy bonus objective, the hypothesis distribution would remain diverse and robust\. As a result, the model can learn to conduct explicit online mental reasoning without the need for ground\-truth mental state annotations\.

![Refer to caption](https://arxiv.org/html/2606.00240v1/x4.png)Figure 3:Our experimental settings for mental state reasoning and proactive assistance: \(1\) GridWorld Question Answering \(Section[5\.1](https://arxiv.org/html/2606.00240#S5.SS1)\); \(2\) GridWorld Proactive Assistance \(Section[5\.2](https://arxiv.org/html/2606.00240#S5.SS2)\); \(3\) Household Question Answering \(Section[5\.3](https://arxiv.org/html/2606.00240#S5.SS3)\); and \(4\) Household Proactive Assistance \(Section[5\.4](https://arxiv.org/html/2606.00240#S5.SS4)\)\.

## 5Experimental Setup

As shown in Figure[3](https://arxiv.org/html/2606.00240#S4.F3), we systematically evaluateMindZeroand baseline methods across four experimental settings: \(1\) GridWorld Question Answering \(Section[5\.1](https://arxiv.org/html/2606.00240#S5.SS1)\), \(2\) GridWorld Proactive Assistance \(Section[5\.2](https://arxiv.org/html/2606.00240#S5.SS2)\), \(3\) Household Question Answering \(Section[5\.3](https://arxiv.org/html/2606.00240#S5.SS3)\), and \(4\) Household Proactive Assistance \(Section[5\.4](https://arxiv.org/html/2606.00240#S5.SS4)\)\. The question answering settings focus on directly answering ToM\-related questions about humans’ mental states, whereas the assistance settings require fast, online mental reasoning about human behavior to provide proactive and accurate support\. We list the evaluated models and baselines in Section[5\.5](https://arxiv.org/html/2606.00240#S5.SS5)\.

### 5\.1GridWorld Question Answering

We adapt theConstructionenvironment\(Jhaet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib18)\), a 2D grid world where agents navigate around obstacles \(e\.g\., walls\) and carry colored objects to different locations\. Here, a human agent aims to assemble two blocks of specific colors by picking up one and moving it toward the other\. The model must infer the human’s intended goal, specifically which two colored blocks the human intends to assemble, given a partial trajectory of diverse human action patterns\. Beyond mental\-state reasoning, the task also requires visual grounding: the model must map the question and trajectory to the correct colored blocks in the scene\. This goes beyond prior ToM QA benchmarks, which are largely story\-based and do not require vision\-language grounding\.

When trainingMindZeroin the GridWorld domain, we assume a uniform prior over the reward defined in Equation \([4](https://arxiv.org/html/2606.00240#S4.E4)\) and use a model\-based planner to estimate action likelihoods\.

### 5\.2GridWorld Proactive Assistance

Using the sameConstructionenvironment as in Section[5\.1](https://arxiv.org/html/2606.00240#S5.SS1), we define a proactive assistance task in which a human agent aims to assemble two blocks of specific colors, while a helper agent must continuously observe the human’s actions, infer the intended goal, and assist in completing it more efficiently\. We evaluate helping performance using speedup, which measures how much the helper accelerates the human’s task completion; metric details are provided in Appendix[A\.2](https://arxiv.org/html/2606.00240#A1.SS2)\. Implementation and data generation details are provided in Appendix[B](https://arxiv.org/html/2606.00240#A2)\.

The proactive assistance setting introduces several challenges beyond story\-based evaluation: \(1\) reasoning must occur atevery timestep, rather than at a single queried moment; \(2\) the model must generatediverse yet plausible hypotheses from scratch, rather than selecting from provided choices; and \(3\) the assistant must performonline goal inference under ambiguity, identifying the user’s goal early enough to provide timely help, but not so early that it commits to an incorrect hypothesis\. Delayed inference limits effective assistance, while premature and incorrect inference can incurlarge penaltieswhen the assistant helps toward the wrong goal and later revises its belief\.

### 5\.3Household Question Answering

We evaluate household question answering using MMToM\-QA\(Jinet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib36)\), a multimodal benchmark that includes questions covering the beliefs and goals of a person searching for an object \(e\.g\., a remote controller\) in a household environment\. The task is challenging because it requires joint inference of both beliefs and goals with both visual and textual inputs\.

For the household domain, we adopt the information fusion methods proposed byJinet al\.\([2024](https://arxiv.org/html/2606.00240#bib.bib36)\)and\(Shiet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib37)\)to combine visual and textual inputs, resulting in fused representations in text form\. All methods receive the same fused information as input\. When trainingMindZero, we use the same pretrained LLM to estimate both the prior and action likelihood terms in the reward defined in Equation \([4](https://arxiv.org/html/2606.00240#S4.E4)\)\. For the prior term, the LLM directly outputs log prior probabilities by judging whether a goal is plausible in the context of a household task\. This incorporates commonsense knowledge from the pretrained LLM and helps constrain the goal space\. Training data generation details are provided in Appendix[C](https://arxiv.org/html/2606.00240#A3)\.

### 5\.4Household Proactive Assistance

We evaluate household assistance using the embodied benchmark Online Watch\-And\-Help \(O\-WAH\)\(Puiget al\.,[2023](https://arxiv.org/html/2606.00240#bib.bib15)\), where a helper agent observes a human’s actions, infers the intended goal, and assists in completing it more efficiently in realistic household environments\. In this task, the helper agent must update its goal inference based on the latest observations in an online manner\. At each step, we use the uncertainty\-aware helping planner proposed inPuiget al\.\([2023](https://arxiv.org/html/2606.00240#bib.bib15)\)to generate assistance actions based on the inferred goals\. To evaluate generalization, we use different apartments for training and testing\. To reduce variance, the results are reported as the average over 3 runs per episode\. We include experiment details in Appendix[C](https://arxiv.org/html/2606.00240#A3)\.

Besides the challenges of proactive assistance described in Section[5\.2](https://arxiv.org/html/2606.00240#S5.SS2), the Household setting introduces additional difficulties: \(1\) a much larger state, action, and goal space \(e\.g\., uncertainty over which objects are needed, how many are required, and their target locations\); \(2\) partial observability, whereas GridWorld is fully observable; and \(3\) significantly longer episode horizons\.

### 5\.5Models and Baselines

We compareMindZeroagainst the following baselines:

- •Base models:For the GridWorld domain \(Section[5\.1](https://arxiv.org/html/2606.00240#S5.SS1)–[5\.2](https://arxiv.org/html/2606.00240#S5.SS2)\), we use the open\-weight multimodal models Qwen3\-VL\-4B and Qwen3\-VL\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib32)\)\. For the Household domain \(Section[5\.3](https://arxiv.org/html/2606.00240#S5.SS3)–[5\.4](https://arxiv.org/html/2606.00240#S5.SS4)\), we use the open\-weight language models Llama\-3\.1\-8B, Llama\-3\.2\-3B\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib33)\), and Qwen3\-4B\(Yanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib32)\), using fused textual inputs\.
- •Large models:Additionally, we evaluate Qwen3\-235B\-A22B, GPT\-5\.2, and Gemini\-3 as zero\-shot performance of large models\. For question answering, we report results with both the thinking and non\-thinking version of the models\. For proactive assistance, we report only the non\-thinking results, as it requires models to make decisions in the real time\.
- •Test\-time scaling methods:We evaluateThoughtTracing\(Kimet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib38)\), a test\-time reasoning approach for mental\-state tracking that maintains and updates multiple hypotheses, andAutoToM\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib35)\), a model\-based method for automated agent modeling\. Both are instantiated with the open\-source base models listed above\. We do not evaluate them in the Proactive Assistance domains due to their slow inference speed, which limits real\-time applicability\. As they do not support visual inputs, we provide textual transcripts of GridWorld observations\. We describe implementation details in Appendix[E](https://arxiv.org/html/2606.00240#A5)\.

![Refer to caption](https://arxiv.org/html/2606.00240v1/x5.png)\(a\)GridWorld Question Answering
![Refer to caption](https://arxiv.org/html/2606.00240v1/x6.png)\(b\)Household Question Answering

Figure 4:Question answering results ofMindZeroand baselines on \(a\) GridWorld and \(b\) Household domains\.MindZeroachieves a 1\.7–2\.5× accuracy \(solid bars\) gain across differentbase modelswith negligible additional inference cost \(hatched bars\), and consistently outperforms alltest\-time scaling baselinesin both accuracy and efficiency\. Full results are shown in Table[4](https://arxiv.org/html/2606.00240#A6.T4)\.For a fair comparison, we evaluateMindZerousing the same open\-source base models described above\.

## 6Experimental Results

Table 1:Proactive assistance results ofMindZero,base models, andlarge modelson \(a\) Gridworld and \(b\) Household domains\. Best results are shown inbold\. \* indicate models that cannot generate goal hypotheses in the correct format at all, and need to be finetuned to follow output format before the RL training\.\(a\)Gridworld Proactive AssistanceMethodSpeedup↑\\uparrowTFLOPs↓\\downarrowRandom Goal0\.0N/ABase ModelsQwen3\-VL\-4B1\.4151\.7Qwen3\-VL\-8B\-0\.1295\.2Large ModelsQwen3\-VL\-235B\-A22B1\.0808\.6GPT\-5\.20\.0ProprietaryGemini\-3\-Flash0\.0ProprietaryMindZero\(Ours\)w/ Qwen3\-VL\-4B23\.0161\.4w/ Qwen3\-VL\-8B24\.5291\.8
\(b\)Household Proactive AssistanceMethodSpeedup↑\\uparrowTFLOPs↓\\downarrowRandom Goal\-2\.2N/ABase ModelsLlama\-3\.2\-3B\*2\.3244\.3Llama\-3\.1\-8B1\.7656\.1Qwen3\-4B2\.3213\.1Large ModelsQwen3\-235B\-A22B12\.31101\.6GPT\-5\.29\.4ProprietaryGemini\-3\-Flash17\.7ProprietaryMindZero\(Ours\)w/ Llama\-3\.2\-3B\*4\.3235\.1w/ Llama\-3\.1\-8B17\.4608\.4w/ Qwen3\-4B19\.1201\.2

### 6\.1Overall Results

#### Question Answering

As shown in Figure[4](https://arxiv.org/html/2606.00240#S5.F4),MindZeroconsistently outperforms pretrained and test\-time scaling baselines in both GridWorld QA \(Figure[4\(a\)](https://arxiv.org/html/2606.00240#S5.F4.sf1)\) and Household QA \(Figure[4\(b\)](https://arxiv.org/html/2606.00240#S5.F4.sf2)\), while maintaining low inference cost\.

In GridWorld QA,MindZeroachieves the best accuracy among all methods with both Qwen3\-VL\-4B and Qwen3\-VL\-8B, substantially improving over their base models and delivering a 2\.1–2\.5×\\timesaccuracy gain\.

In Household QA,MindZerolikewise achieves strong performance across all base models, withMindZerow/ Llama\-3\.2\-3B attaining the highest accuracy among open\-weight and test\-time scaling methods and remaining competitive with the best proprietary systems despite minimal inference cost\. Compared withThoughtTracingandAutoToM, which require substantially more test\-time computation,MindZerodelivers a clearly better accuracy\-efficiency trade\-off, even when those methods use much larger backend models\.

#### Proactive Assistance

As shown in Table[1](https://arxiv.org/html/2606.00240#S6.T1),MindZeroachieves the best performance among all and yields substantial gains from base models in task completion speed in both GridWorld Proactive Assistance \(Table[1\(a\)](https://arxiv.org/html/2606.00240#S6.T1.st1)\) and Household Proactive Assistance \(Table[1\(b\)](https://arxiv.org/html/2606.00240#S6.T1.st2)\), where all baselines provide little to no speedup\.

In GridWorld Proactive Assistance,MindZeroachieves 23\.0% and 24\.5% speedup with Qwen3\-VL\-4B and Qwen3\-VL\-8B, respectively\. In contrast, GPT\-5\.2 and Gemini\-3\-Flash yield no speedup, as their goal predictions change constantly, causing the agent’s actions to become unstable \(i\.e\., frequently changing directions\)\. As a result, the agent fails to pick up an object before the task ends\.

In Household Proactive Assistance,MindZerowith Qwen3\-4B achieves a best speedup of 19\.1%, significantly higher than the strongest baseline with the least inference cost\. A notable exception isMindZerowith Llama\-3\.2\-3B, which does not show a significant gain over its base model\. This is because it cannot produce goal hypotheses in the required format, we first fine\-tune it on generations sampled from the pretrained Llama\-3\.1\-8B before RL training, avoiding any reliance on ground\-truth or pseudo labels\. However, while this warm\-up teaches the correct format, the relatively low quality of the sampled generations appears to be memorized as well, introducing a bias that ultimately suppresses the expected improvement\.

![Refer to caption](https://arxiv.org/html/2606.00240v1/x7.png)\(a\)GridWorld Proactive Assistance
![Refer to caption](https://arxiv.org/html/2606.00240v1/x8.png)\(b\)Household Proactive Assistance

Figure 5:Goal accuracy or F1 score for online goal inference versus task progress across \(a\) GridWorld and \(b\) Household proactive assistance\.MindZero’s \(bold solid curves\) predicted goal steadily improves in accuracy over time and reaches a strong level, while most baselines \(dashed curves\) remain much lower or improve more slowly\.

### 6\.2Online Goal Inference Dynamics

Figure[5](https://arxiv.org/html/2606.00240#S6.F5)shows the accuracy of online goal inference as task progress increases in both GridWorld and Household Proactive Assistance\. In both settings,MindZerosteadily improves its goal prediction over time, indicating that it can effectively accumulate evidence from ongoing interaction and refine its belief about the user’s objective\. In GridWorld \(Figure[5\(a\)](https://arxiv.org/html/2606.00240#S6.F5.sf1)\),MindZerois the only method whose accuracy rises substantially as the task unfolds, eventually reaching a strong level\. In contrast, all baselines remain very low for most of the trajectory and only increase in accuracy near the end, making effective assistance difficult\. In Household \(Figure[5\(b\)](https://arxiv.org/html/2606.00240#S6.F5.sf2)\),MindZeroagain achieves the strongest performance, with prediction accuracy increasing consistently, significantly outperforming base models and matching much larger pretrained models\. These results suggest that accurate and stable online goal inference is a key reason whyMindZerocan deliver effective proactive assistance\.

### 6\.3Ablation Study

To understand the key components drivingMindZero’s performance, we conduct comprehensive ablation studies on Qwen3\-4B, as shown in Table[2](https://arxiv.org/html/2606.00240#S6.T2)\. We examine three critical design choices: prior modeling, multiple hypotheses, and entropy bonus\. All experiments use the same training configuration as our main experiments\.

Table 2:Ablation on Household Proactive Assistance using Qwen3\-4B\.\#MethodSpeedup↑\\uparrowTFLOPs↓\\downarrowIMindZero19\.1201\.2IIw/o prior modeling17\.0200\.5IIIw/o multiple hypotheses10\.3132\.6IVw/o entropy bonus5\.2245\.1#### Explicit Prior Modeling

In the household environment, humans are assumed to pursue a set of predefined goal types, such as setting up the dinner table or putting dishes in the dishwasher\. We explicitly require an LLM to check whether each goal hypothesis is reasonable\. For example, putting an apple into the dishwasher will be assigned a very low score\. This constraint is key to generating plausible hypotheses and prevents reward hacking of action likelihood, e\.g\., including every possible item in the goal yields a high action\-likelihood score but a low prior score\. Compared to the full model \(Row I\), the speedup drops by 2\.1% without explicit prior modeling \(Row II\)\.

#### Multiple Hypotheses

Maintaining a set of mental state hypotheses is important for capturing the uncertainty of understanding human behavior\. For example, in the early stage of an episode, the assistant can only observe a limited human behavior, thus each hypothesis remains ambiguous and carries low confidence\. Relying on a single estimation would lead to premature commitment to a potentially incorrect goal\. By tracking a beam of hypotheses, the system can defer the decision until sufficient evidence is accumulated\. Compared to the full model \(Row I\), the speedup drops for 8\.8% comparing to generating a single most possible mental state \(Row III\)\. Accordingly, the token usage is the least\.

#### Entropy Bonus

Hypothesis distribution often suffers from mode collapse, where the model becomes overconfident in a single prediction too early\. To mitigate this, the entropy regularization term in Equation \([3](https://arxiv.org/html/2606.00240#S4.E3)\) encourages the diversity of the hypothesis space\. This bonus penalizes overly peaked distributions and ensures the model retains alternative possibilities during reasoning\. Compared to the full model \(Row I\), the speedup drops for 13\.9% without the entropy bonus \(Row IV\)\.

### 6\.4Human Experiment

To evaluate whetherMindZerocan support real users, we conducted a human experiment in the Household Proactive Assistance domain\. Participants acted as the main agent and completed four household tasks from our test set\. We recruited 12 participants from Johns Hopkins University\. The study was approved by the JHU institutional review board\.

Experimental Setup\.We compare four settings: a Single Human without assistance, and assistance with Qwen3\-4B, withMindZerotrained from Qwen3\-4B, and with Gemini\-3\-Flash\. The Single Human setting serves as the reference for computing speedup\. All assisted settings use the same helper\-agent pipeline as in the Household Proactive Assistance experiments, varying only the mental inference model\.

Results\.The pretrained Qwen3\-4B model yields only a marginal speedup of 2\.6%\. In contrast,MindZerotrained from Qwen3\-4B achieves a speedup of 19\.7% \(standard error 6\.3%\), a substantial improvement over the same Qwen3\-4B backbone\. Gemini\-3\-Flash achieves a speedup of 23\.4% \(standard error 6\.4%\)\. Although Gemini\-3\-Flash attains a slightly higher mean speedup, the difference between Gemini\-3\-Flash andMindZerois not statistically significant under a paired t\-test on speedup \(p=0\.24p=0\.24\), consistent with the results in Section[6](https://arxiv.org/html/2606.00240#S6)\.

These results show thatMindZerotransfers to real human behavior and provides effective assistance\.MindZeroreaches performance comparable to Gemini\-3\-Flash while using a small open\-weight model, making it easier to deploy locally and more cost\-effective for large\-scale assistance\.

## 7Conclusion

We introducedMindZero, a self\-supervised reinforcement learning framework for training multimodal language models to perform robust and efficient online Theory of Mind reasoning without relying on mental state annotations\. By rewarding hypotheses that best explain observed behavior,MindZeroenables models to internalize the deliberative structure of model\-based ToM while retaining the speed of single\-pass inference\. Extensive evaluations across question answering and proactive assistance tasks demonstrate thatMindZeroachieves strong robustness and uncertainty tracking comparable to explicit model\-based methods, while substantially reducing computational cost\. These results show that mental reasoning can be learned as a self\-supervised skill grounded in behavioral evidence, bridging the long\-standing gap between interpretability, robustness, and efficiency in ToM modeling\. We believeMindZeroprovides a promising foundation for scalable, real\-world assistive agents that can continuously reason about human intentions and adapt to dynamic environments\.

Limitations and Future Work\.Our currentMindZeroframework does not model recursive reasoning between multiple agents\. Additionally, as the input sequence length increases, the required input token length for the model will increase accordingly\. In the future, we intend to expandMindZeroto incorporate multi\-agent recursive mental reasoning into the training process\. We also plan to develop a more efficient model structure to address the challenge of long input sequences\.

## Impact Statement

This paper presents work aimed at advancing the field of machine learning by developing more robust and efficient methods for online Theory of Mind reasoning in assistive AI systems\. By enabling models to infer human intentions and uncertainty from behavior without relying on explicit annotations, our approach has the potential to enhance the reliability, responsiveness, and scalability of AI agents in real\-world applications such as household assistance, digital services, and human–computer interaction\. These advances may contribute to more helpful, adaptive, and accessible technologies that better align with users’ needs and preferences, thereby improving user experience and productivity\.

At the same time, enhanced mental reasoning capabilities may raise ethical considerations\. Systems that more accurately model human intentions and beliefs may be misused for manipulation, surveillance, or unwanted behavioral profiling if deployed without appropriate safeguards\. Moreover, errors in inferred mental states could result in inappropriate assistance, reduced user autonomy, or the reinforcement of existing biases present in behavioral data\. We emphasize that responsible use requires transparency, user consent, and careful evaluation in real\-world settings\. We hope this research encourages further discussion on the ethical development and deployment of human\-centered AI systems and supports future work on fairness, accountability, and privacy\-preserving mental reasoning models\.

## Author Contributions

Shunchi Zhang conceived the idea and developed it into the present work; he carried out the main environment setup, data processing, model training, and evaluation, including the extensive exploratory experiments and the core experimental results reported in the paper\. Jin Lu conducted a large number of additional experiments, primarily baselines and supplementary studies; he also independently performed the human study, contributed to the early\-stage exploration of the GridWorld experiments, and carried out exploratory work on web assistance that informed the final design\. Chuanyang Jin contributed to paper writing and figure design\. Yichao Zhou implemented the GridWorld environment setup, data processing, model training, and evaluation, under Shunchi Zhang’s assistance\. Zhining Zhang contributed the AutoToM\-related experiments\. Tianmin Shu provided overall research direction and weekly guidance and contributed to the paper revision\. All authors contributed to the paper writing\.

## Acknowledgement

This work is supported by a grant from Amazon\. Chuanyang Jin is supported by the Amazon AI PhD Fellowship\.

## References

- C\. L\. Baker, J\. Jara\-Ettinger, R\. Saxe, and J\. B\. Tenenbaum \(2017\)Rational quantitative attribution of beliefs, desires and percepts in human mentalizing\.Nature Human Behaviour1\(4\),pp\. 0064\.Cited by:[§3\.1](https://arxiv.org/html/2606.00240#S3.SS1.p1.6)\.
- C\. L\. Baker, R\. Saxe, and J\. B\. Tenenbaum \(2009\)Action understanding as inverse planning\.Cognition113\(3\),pp\. 329–349\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.00240#S3.SS1.p1.6)\.
- C\. Bara, C\. Sky, and J\. Chai \(2021\)MindCraft: theory of mind modeling for situated dialogue in collaborative tasks\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1112–1125\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Bengio, R\. Ducharme, P\. Vincent, and C\. Jauvin \(2003\)A neural probabilistic language model\.Journal of Machine Learning Research \(JMLR\)3\(Feb\),pp\. 1137–1155\.Cited by:[§4\.1](https://arxiv.org/html/2606.00240#S4.SS1.p1.1)\.
- C\. M\. Bishop \(2006\)Pattern recognition and machine learning\.Springer\.Cited by:[§4\.2](https://arxiv.org/html/2606.00240#S4.SS2.p1.3)\.
- M\. Bortoletto, C\. Ruhdorfer, and A\. Bulling \(2025a\)ToM\-ssi: evaluating theory of mind in situated social interactions\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 32252–32277\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Bortoletto, C\. Ruhdorfer, L\. Shi, and A\. Bulling \(2024a\)Explicit modelling of theory of mind for belief prediction in nonverbal social interactions\.arXiv preprint arXiv:2407\.06762\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p3.1)\.
- M\. Bortoletto, L\. Shi, and A\. Bulling \(2024b\)Neural reasoning about agents’ goals, preferences, and actions\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Vol\.38,pp\. 456–464\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p3.1)\.
- M\. Bortoletto, Y\. Zhou, L\. Ying, T\. Shu, and A\. Bulling \(2025b\)ProToM: promoting prosocial behaviour via theory of mind\-informed feedback\.arXiv preprint arXiv:2509\.05091\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Y\. Chai, L\. She, R\. Fang, S\. Ottarson, C\. Littley, C\. Liu, and K\. Hanson \(2014\)Collaborative effort towards common ground in situated human\-robot dialogue\.InProceedings of the 2014 ACM/IEEE international conference on Human\-robot interaction,pp\. 33–40\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Chen, K\. Lu, A\. Rajeswaran, K\. Lee, A\. Grover, M\. Laskin, P\. Abbeel, A\. Srinivas, and I\. Mordatch \(2021\)Decision transformer: reinforcement learning via sequence modeling\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.34,pp\. 15084–15097\.Cited by:[§4\.1](https://arxiv.org/html/2606.00240#S4.SS1.p1.1)\.
- L\. Cross, V\. Xiang, A\. Bhatia, D\. L\. Yamins, and N\. Haber \(2024\)Hypothetical minds: scaffolding theory of mind for multi\-agent tasks with large language models\.arXiv preprint arXiv:2407\.07086\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[1st item](https://arxiv.org/html/2606.00240#S5.I1.i1.p1.1)\.
- X\. Fan, X\. Zhou, C\. Jin, K\. Nottingham, H\. Zhu, and M\. Sap \(2025\)SoMi\-tom: evaluating multi\-perspective theory of mind in embodied social interactions\.InAdvances in Neural Information Processing Systems Datasets and Benchmarks \(NeurIPS D&B\),Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Gandhi, J\. Fränken, T\. Gerstenberg, and N\. Goodman \(2023\)Understanding social reasoning in language models with language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36,pp\. 13518–13529\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§4\.1](https://arxiv.org/html/2606.00240#S4.SS1.p2.1)\.
- G\. Hou, W\. Zhang, Y\. Shen, L\. Wu, and W\. Lu \(2024\)TimeToM: temporal space is the key to unlocking the door of large language models’ theory\-of\-mind\.InFindings of the Association for Computational Linguistics: ACL,pp\. 11532–11547\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- X\. A\. Huang, E\. La Malfa, S\. Marro, A\. Asperti, A\. G\. Cohn, and M\. J\. Wooldridge \(2024\)A notion of complexity for theory of mind via discrete world models\.InFindings of the Association for Computational Linguistics: EMNLP,pp\. 2964–2983\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Jayannavar, A\. Narayan\-Chen, and J\. Hockenmaier \(2020\)Learning to execute instructions in a minecraft dialogue\.InProceedings of the 58th annual meeting of the association for computational linguistics,pp\. 2589–2602\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Jha, T\. A\. Le, C\. Jin, Y\. Kuo, J\. B\. Tenenbaum, and T\. Shu \(2024\)Neural amortized inference for nested multi\-agent reasoning\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Vol\.38,pp\. 530–537\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p6.1),[§5\.1](https://arxiv.org/html/2606.00240#S5.SS1.p1.1)\.
- C\. Jin, B\. Li, H\. Xie, C\. M\. Fang, T\. Li, S\. Longpre, H\. Gu, M\. Chen, and T\. Shu \(2026\)ThoughtTrace: understanding user thoughts in real\-world llm interactions\.arXiv preprint arXiv:2605\.20087\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Jin, Y\. Wu, J\. Cao, J\. Xiang, Y\. Kuo, Z\. Hu, T\. Ullman, A\. Torralba, J\. Tenenbaum, and T\. Shu \(2024\)Mmtom\-qa: multimodal theory of mind question answering\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 16077–16102\.Cited by:[§C\.2](https://arxiv.org/html/2606.00240#A3.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.00240#S1.p1.1),[§1](https://arxiv.org/html/2606.00240#S1.p4.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2606.00240#S5.SS3.p1.1),[§5\.3](https://arxiv.org/html/2606.00240#S5.SS3.p2.1)\.
- C\. Jin, J\. Xu, B\. Liu, L\. Tao, O\. Golovneva, T\. Shu, W\. Zhao, X\. Li, and J\. Weston \(2025\)The era of real\-world human interaction: rl from user conversations\.arXiv preprint arXiv:2509\.25137\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Jung, D\. Kim, J\. Jin, J\. Kim, Y\. Seonwoo, Y\. Choi, A\. Oh, and H\. Kim \(2024\)Perceptions to beliefs: exploring precursory inferences for theory of mind in large language models\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 19794–19809\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§A\.2](https://arxiv.org/html/2606.00240#A1.SS2.SSS0.Px2.p1.2)\.
- H\. Kim, M\. Sclar, T\. Zhi\-Xuan, L\. Ying, S\. Levine, Y\. Liu, J\. B\. Tenenbaum, and Y\. Choi \(2025\)Hypothesis\-driven theory\-of\-mind reasoning for large language models\.InProceedings of the Conference on Language Modeling \(COLM\),Cited by:[4\(a\)](https://arxiv.org/html/2606.00240#A6.T4.st1.2.13.1.1),[4\(b\)](https://arxiv.org/html/2606.00240#A6.T4.st2.2.13.1.1),[§1](https://arxiv.org/html/2606.00240#S1.p4.1),[3rd item](https://arxiv.org/html/2606.00240#S5.I1.i3.p1.1)\.
- H\. Kim, M\. Sclar, X\. Zhou, R\. Bras, G\. Kim, Y\. Choi, and M\. Sap \(2023\)FANToM: a benchmark for stress\-testing machine theory of mind in interactions\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 14397–14413\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- D\. P\. Kingma and M\. Welling \(2014\)Auto\-encoding variational bayes\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§4\.2](https://arxiv.org/html/2606.00240#S4.SS2.p1.3)\.
- M\. Le, Y\. Boureau, and M\. Nickel \(2019\)Revisiting the evaluation of theory of mind through question answering\.InProceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 5872–5877\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Liang, D\. Chen, A\. Gupta, S\. S\. Du, and N\. Jaques \(2024\)Learning to cooperate with humans using generative agents\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.37,pp\. 60061–60087\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Liu, R\. Fang, and J\. Chai \(2012\)Towards mediating shared perceptual basis in situated dialogue\.InProceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue,pp\. 140–149\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Lu, C\. Zhang, J\. Song, L\. Fan, and W\. Wang \(2025\)Do theory of mind benchmarks need explicit human\-like reasoning in language models?\.arXiv preprint arXiv:2504\.01698\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Ma, K\. Lu, R\. Desai, X\. Puig, A\. Markham, and N\. Trigoni \(2025\)Coopera: continual open\-ended human\-robot assistance\.arXiv preprint arXiv:2510\.23495\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Narayan\-Chen, P\. Jayannavar, and J\. Hockenmaier \(2019\)Collaborative dialogue in minecraft\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 5405–5415\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Puig, T\. Shu, S\. Li, Z\. Wang, Y\. Liao, J\. B\. Tenenbaum, S\. Fidler, and A\. Torralba \(2021\)Watch\-and\-help: a challenge for social perception and human\-ai collaboration\.InProceedings of the International Conference on Learning Representations \(ICLR\),Cited by:[§C\.1](https://arxiv.org/html/2606.00240#A3.SS1.p1.1),[§C\.2](https://arxiv.org/html/2606.00240#A3.SS2.SSS0.Px2.p1.1)\.
- X\. Puig, T\. Shu, J\. B\. Tenenbaum, and A\. Torralba \(2023\)NOPA: neurally\-guided online probabilistic assistance for building socially intelligent home assistants\.InProceedings of the IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 7628–7634\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p6.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1),[§5\.4](https://arxiv.org/html/2606.00240#S5.SS4.p1.1)\.
- N\. Rabinowitz, F\. Perbet, F\. Song, C\. Zhang, S\. A\. Eslami, and M\. Botvinick \(2018\)Machine theory of mind\.InProceedings of the International Conference on Machine Learning \(ICML\),pp\. 4218–4227\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p3.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, K\. Narasimhan, T\. Salimans, I\. Sutskever,et al\.\(2018\)Improving language understanding by generative pre\-training\.Cited by:[§4\.1](https://arxiv.org/html/2606.00240#S4.SS1.p1.1)\.
- N\. Rhinehart, R\. McAllister, K\. Kitani, and S\. Levine \(2019\)Precog: prediction conditioned on goals in visual multi\-agent settings\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 2821–2830\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p3.1)\.
- M\. Sclar, S\. Kumar, P\. West, A\. Suhr, Y\. Choi, and Y\. Tsvetkov \(2023\)Minding language models’\(lack of\) theory of mind: a plug\-and\-play multi\-character belief tracker\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 13960–13980\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p1.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Sclar, J\. Yu, M\. Fazel\-Zarandi, Y\. Tsvetkov, Y\. Bisk, Y\. Choi, and A\. Celikyilmaz \(2024\)Explore theory of mind: program\-guided adversarial data generation for theory of mind reasoning\.arXiv preprint arXiv:2412\.12175\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Seo, B\. Han, and V\. Unhelkar \(2023\)Automated task\-time interventions to improve teamwork using imitation learning\.arXiv preprint arXiv:2303\.00413\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§4\.1](https://arxiv.org/html/2606.00240#S4.SS1.p2.1)\.
- H\. Shi, S\. Ye, X\. Fang, C\. Jin, L\. Isik, Y\. Kuo, and T\. Shu \(2025\)Muma\-tom: multi\-modal multi\-agent theory of mind\.InProceedings of the AAAI Conference on Artificial Intelligence \(AAAI\),Vol\.39,pp\. 1510–1519\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p4.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1),[§5\.3](https://arxiv.org/html/2606.00240#S5.SS3.p2.1)\.
- M\. Shvo, R\. Hari, Z\. O’Reilly, S\. Abolore, S\. N\. Wang, and S\. A\. McIlraith \(2022\)Proactive robotic assistance via theory of mind\.InProceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 9148–9155\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Suhr, C\. Yan, J\. Schluger, S\. Yu, H\. Khader, M\. Mouallem, I\. Zhang, and Y\. Artzi \(2019\)Executing instructions in situated collaborative interactions\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 2119–2130\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Ullman, C\. Baker, O\. Macindoe, O\. Evans, N\. Goodman, and J\. Tenenbaum \(2009\)Help or hinder: bayesian models of social goal inference\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.22\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Ullman \(2023\)Large language models fail on trivial alterations to theory\-of\-mind tasks\.arXiv preprint arXiv:2302\.08399\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p1.1)\.
- Q\. Wang, K\. Saha, E\. Gregori, D\. Joyner, and A\. Goel \(2021\)Towards mutual theory of mind in human\-ai interaction: how language reflects what students perceive about a virtual teaching assistant\.InProceedings of the CHI Conference on Human Factors in Computing Systems,pp\. 1–14\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Wilf, S\. Lee, P\. P\. Liang, and L\. Morency \(2024\)Think twice: perspective\-taking improves large language models’ theory\-of\-mind capabilities\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 8292–8308\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p1.1)\.
- H\. Wimmer and J\. Perner \(1983\)Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception\.Cognition13\(1\),pp\. 103–128\.Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p1.1)\.
- Y\. Wu, Y\. He, Y\. Jia, R\. Mihalcea, Y\. Chen, and N\. Deng \(2023\)Hi\-tom: a benchmark for evaluating higher\-order theory of mind reasoning in large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 10691–10706\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Xu, R\. Zhao, L\. Zhu, J\. Du, and Y\. He \(2024\)OpenToM: a comprehensive benchmark for evaluating theory\-of\-mind reasoning capabilities of large language models\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 8593–8623\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[1st item](https://arxiv.org/html/2606.00240#S5.I1.i1.p1.1)\.
- L\. Ying, K\. M\. Collins, M\. Wei, C\. E\. Zhang, T\. Zhi\-Xuan, A\. Weller, J\. B\. Tenenbaum, and L\. Wong \(2023\)The neuro\-symbolic inverse planning engine \(NIPE\): modeling probabilistic social inferences from linguistic inputs\.InFirst Workshop on Theory of Mind in Communicating Agents,Cited by:[§1](https://arxiv.org/html/2606.00240#S1.p4.1)\.
- L\. Ying, K\. Jha, S\. Aarya, J\. B\. Tenenbaum, A\. Torralba, and T\. Shu \(2024\)GOMA: proactive embodied cooperative communication via goal\-oriented mental alignment\.In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),pp\. 7099–7106\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Yu, Q\. Wang, S\. Zhang, Y\. Sang, K\. Pu, Z\. Wei, H\. Wang, L\. Xu, J\. Li, Y\. Yu,et al\.\(2024\)Few\-shot character understanding in movies as an assessment to meta\-learning of theory\-of\-mind\.InProceedings of the International Conference on Machine Learning \(ICML\),pp\. 57703–57729\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, P\. Robertson, T\. Shu, S\. Hong, and B\. C\. Williams \(2024\)Risk\-bounded online team interventions via theory of mind\.In2024 IEEE International Conference on Robotics and Automation \(ICRA\),pp\. 12964–12970\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhang, C\. Jin, M\. Y\. Jia, S\. Zhang, and T\. Shu \(2025\)Autotom: scaling model\-based mental inference via automated agent modeling\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.3](https://arxiv.org/html/2606.00240#A1.SS3.p2.1),[§C\.1](https://arxiv.org/html/2606.00240#A3.SS1.p1.1),[4\(a\)](https://arxiv.org/html/2606.00240#A6.T4.st1.2.19.1.1),[4\(b\)](https://arxiv.org/html/2606.00240#A6.T4.st2.2.20.1.1),[§1](https://arxiv.org/html/2606.00240#S1.p4.1),[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1),[3rd item](https://arxiv.org/html/2606.00240#S5.I1.i3.p1.1)\.
- T\. Zhi\-Xuan, J\. Mann, T\. Silver, J\. Tenenbaum, and V\. Mansinghka \(2020\)Online bayesian goal inference for boundedly rational planning agents\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,pp\. 19238–19250\.Cited by:[§3\.1](https://arxiv.org/html/2606.00240#S3.SS1.p1.6),[§3\.1](https://arxiv.org/html/2606.00240#S3.SS1.p3.1)\.
- T\. Zhi\-Xuan, L\. Ying, V\. Mansinghka, and J\. B\. Tenenbaum \(2024\)Pragmatic instruction following and goal assistance via cooperative language\-guided inverse planning\.arXiv preprint arXiv:2402\.17930\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Zhou, Q\. Wang, M\. Yu, X\. Yue, R\. Lu, J\. Li, Y\. Zhou, S\. Zhang, J\. Zhou, and W\. Lam \(2025a\)The essence of contextual understanding in theory of mind: a study on question answering with story characters\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 22612–22631\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Zhou, V\. Chen, Z\. Z\. Wang, G\. Neubig, M\. Sap, and X\. Wang \(2025b\)Tom\-swe: user mental modeling for software engineering agents\.arXiv preprint arXiv:2510\.21903\.Cited by:[§2](https://arxiv.org/html/2606.00240#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AMindZeroImplementation Details

### A\.1Model Training

AllMindZeromodels are trained with standard GRPO in theVeRLframework using 4×\\timesH100 GPUs\. For Household domain, we additionally serve a reward model Qwen3\-235B\-A22B\-FP8 model withvLLMusing 4×\\timesH100 GPUs\. We use 32 rollout samples per prompt as the hypothesis proposal set, a rollout batch size of 32, a global batch size of 8, and train for 20 epochs with AdamW in bf16\. The main optimization hyperparameters are a learning rate of1×10−61\\times 10^\{\-6\}, weight decay of1×10−21\\times 10^\{\-2\}, a max grad norm of 1\.0, and a KL coefficient of1×10−21\\times 10^\{\-2\}\. Detailed configurations are open\-sourced at[https://github\.com/SCAI\-JHU/MindZero/tree/main/configs](https://github.com/SCAI-JHU/MindZero/tree/main/configs)\.

### A\.2Evaluation Metrics

#### Speedup in Proactive Assistance\.

We measure collaborative efficiency using thespeedupmetric:

speedup=ThumanTcollab−1\\text\{speedup\}=\\frac\{T\_\{\\text\{human\}\}\}\{T\_\{\\text\{collab\}\}\}\-1\(5\)whereThumanT\_\{\\text\{human\}\}denotes the time required when the helper remains stationary, andTcollabT\_\{\\text\{collab\}\}denotes the time taken with active assistance\.

#### Inference Cost\.

We report the inference cost in terms of floating point operations \(FLOPs\) using the approximation

FLOPs=2×Pactive×NtokensFLOPsTrillion=2×PactiveBillion×Ntokens×11000,\\begin\{split\}\\text\{FLOPs\}&=2\\times P\_\{\\text\{active\}\}\\times N\_\{\\text\{tokens\}\}\\\\ \\frac\{\\text\{FLOPs\}\}\{\\text\{Trillion\}\}&=2\\times\\frac\{P\_\{\\text\{active\}\}\}\{\\text\{Billion\}\}\\times N\_\{\\text\{tokens\}\}\\times\\frac\{1\}\{1000\},\\end\{split\}\(6\)wherePactiveP\_\{\\text\{active\}\}denotes the active parameter count andNtokensN\_\{\\text\{tokens\}\}represents the total number of processed tokens\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.00240#bib.bib23)\)\.

### A\.3Prompt Examples

We use the same instruction but different context inputs for every task\. Examples are shown in Figure[6](https://arxiv.org/html/2606.00240#A1.F6)\-[9](https://arxiv.org/html/2606.00240#A1.F9)for task context\.

For reward evaluation in Household domain, we adopt similar prompts in AutoToM\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib35)\)\.

![Refer to caption](https://arxiv.org/html/2606.00240v1/figures/hf_gw_tom_1.jpg)

<image\>YouareahelperagentinaGridWorldenvironment\.Youaretheredrobot,andtheHumanisthegreenrobot\.Therearemultipleobjects:brownsquare,pinksquare,redsquare,bluesquare,greensquare,yellowsquare,orangesquare,andpurplesquare\.TheHuman’sgoalistoplacetwooftheobjectsnexttoeachother\.TheHumancanmoveup,down,left,right,orstay,andcanpickupanobjectwhenstandingonitandnotholdingone,andcanputdownanobjectwhenholdingoneandthecellisempty\.TheHuman’sactiontrajectorysofarisshownintheimage\.GiventhattheHumanintendstoplaceanobjectnexttotheyellowsquare,whichobjectistheHumanmorelikelytopickupnext?\(a\)orangesquare\.\(b\)greensquare\.

Pleaserespondwithonlyasinglelowercaseletteraorb\.

Figure 6:A prompt example for GridWorld Question Answering\.![Refer to caption](https://arxiv.org/html/2606.00240v1/figures/hf_gw_asst_1.jpg)

<image\>YouareahelperagentinaGridWorldenvironment\.Youaretheredrobot,andtheHumanisthegreenrobot\.Therearemultipleobjects:pinksquare,bluesquare,purplesquare,yellowsquare,brownsquare,greensquare,orangesquare,andredsquare\.TheHuman’sgoalistoplacetwooftheobjectsnexttoeachother\.TheHumancanmoveup,down,left,right,orstay,andcanpickupanobjectwhenstandingonitandnotholdingone,andcanputdownanobjectwhenholdingoneandthecellisempty\.TheHuman’sactiontrajectorysofarisshownintheimage\.Pleaseproposeaprobabilitydistributionthatincludes2candidatepairedgoalsandtheirprobabilities\.YourresponseshouldincludetheprobabilitydistributionformattedaccordingtothisJSONschema:\{"$defs":\{"GoalParticle":\{"properties":\{"object1":\{"$ref":"\#/$defs/Object"\},"object2":\{"$ref":"\#/$defs/Object"\},"p":\{"description":"Probabilityofthegoalproposal","maximum":1,"minimum":0,"title":"P","type":"number"\}\},"required":\["object1","object2","p"\],"title":"GoalParticle","type":"object"\},"Object":\{"properties":\{"color":\{"title":"Color","type":"string"\},"shape":\{"title":"Shape","type":"string"\}\},"required":\["color","shape"\],"title":"Object","type":"object"\}\},"properties":\{"particles":\{"items":\{"$ref":"\#/$defs/GoalParticle"\},"title":"Particles","type":"array"\}\},"required":\["particles"\],"title":"GoalParticles","type":"object"\}\.

NotethattheHuman\(greenrobot\)consistentlyprioritizespickinguptheobjectclosesttoitsinitialstartingpositionfirst,subsequentlyplacingitnexttotheobjectthatwasinitiallyfurtheraway\.InyourJSONresponse,ensurethatforeveryGoalParticle,object1isstrictlytheobjectclosertotheHuman\(greenrobot\)’sstartingposition,andobject2istheobjectfurtherfromit\.

PleaseoutputtheminifiedJSON\.

Figure 7:A prompt example for GridWorld Proactive Assistance\.What’sinsidetheapartment:Thereisakitchenandabathroomandabedroomandalivingroom\.

fourkitchencabinetsandastoveandarefrigeratorandamicrowaveandakitchentableareinthekitchen\.acondimentbottleisonthefourthkitchencabinet\.adishbowlandtwowineglassesandanappleareonthefirstkitchencabinet\.adishbowlandabottleofwineandacondimentbottleandawineglassareonthethirdkitchencabinet\.Thereisnothinginsidethestove\.aplateandacupcakeandabottleofwineandadishbowlareinsidetherefrigerator\.asalmonisinsidethemicrowave\.

abathroomcabinetisinthebathroom\.Thereisnothinginsidethebathroomcabinet\.

acoffeetableandadeskareinthebedroom\.

acoffeetableandacabinetandadeskandasofaareinthelivingroom\.awaterglassandabookareonthecoffeetable\.twocupcakesandtwodishbowlsandaremotecontrolandawineglassareinsidethecabinet\.

ActionstakenbyMary:Maryisinsidethebedroom\.Marywalkstowardsthekitchen\.

Question:IfMaryhasbeentryingtogetadishbowl,whichoneofthefollowingstatementsismorelikelytobetrue?\(a\)Marythinksthatthedishbowlisinsidethekitchen\.\(b\)Marythinksthatthedishbowlisnotinsidethekitchen\.Pleaserespondwitheitheraorb\.

YouFIRSTthinkaboutthereasoningprocessasaninternalmonologueandthenprovidethefinalanswer\.ThereasoningprocessMUSTBEenclosedwithin<thinking\></thinking\>tags\.ThefinalanswerMUSTBEputin\\boxed\{\}\.

Figure 8:A prompt example for Household Question Answering\.Humanhasbeenworkingonataskofmovingsomeobjectstoatargetlocation\.Thetasktypecanonlybeoneofthefollowing:settingupatable,puttingsomethinginthedishwasher,puttingsomethinginthefridge,preparingfood,orwatchingTV\.

Yourareahelpfulassistant\.Inordertohelphuman,pleaseproposemultiplehypothesesof\[human’soverallgoal\]\(includingbothfinishedandpotentialfuturesubgoals\),baseonthefollowinginformation:

\[currentstate\]

Theapartmenthas4rooms:bathroom,bedroom,kitchen,livingroom\.

Thebathroomhas1bathroomcabinet\.

Thebedroomhas1coffeetable\.

\-Thecoffeetablesupports1wineglass,1plate\.

Thekitchenhas1fridge,4kitchencabinet,1kitchentable,1microwave,1stove\.

\-Thefridgecontains1plate,2cupcake,1salmon,1pudding\.

\-Thekitchencabinetcontains1apple,3cutleryfork\.

\-Thekitchencabinetcontains1wineglass,1cutleryfork\.

\-Thekitchencabinetcontains1wineglass,1cutleryfork\.

\-Thekitchencabinetcontains2condimentbottle\.

\-Themicrowavecontains1condimentbottle,1salmon\.

\-Thestovecontains1salmon,1cupcake\.

Thelivingroomhas1cabinet,1coffeetable\.

\-Thecabinetcontains1remotecontrol,1cupcake,1wineglass\.

\-Thecoffeetablesupports1plate,1remotecontrol\.

Humanisinthekitchen\.

Humaniscloseto4wallpictureframe,1salmon,2condimentbottle,1microwave,1wallphone,6bellpepper,3kitchencounterdrawer,1dishbowl,1clock,1lightswitch,1pudding,1cutleryknife,1plate,1fridge,1powersocket,1book,1bench,1sink,1kitchencounter,1kitchencabinet,1rug\.

Humanisholdingnothing\.

\[keyactionhistory\]

Humanhasnottakenanykeyactionyet\.

\[human’snextaction\]

Humanwalkstowardsthekitchencabinet

Hints:

\-Thetasktypeisconstantandthetargetlocationisunique,i\.e\.,humanwillbeconsistentlydoingthesametask\(settingupatable,puttingsomethinginthedishwasher,puttingsomethinginthefridge,preparingfood,orwatchingTV\)andputallobjectstothesamelocation\.

\-Pleaseproposediversegoalsinbothobjecttypeandcount\.

OutputRequirements:

Pleaseprovideaprobabilitydistributionovern=10hypothesesof\[human’soverallgoal\]\(includingbothfinishedandpotentialfuturesubgoals\)\.

YourresponseshouldincludetheprobabilitydistributionformattedaccordingtothisJSONschema:\{’$defs’:\{’GoalParticle’:\{’properties’:\{’task\_name’:\{’enum’:\[’prepare\_food’,’put\_dishwasher’,’put\_fridge’,’setup\_table’,’watch\_tv’\],’title’:’TaskName’,’type’:’string’\},’objects’:\{’items’:\{’$ref’:’\#/$defs/Object’\},’minItems’:1,’title’:’Objects’,’type’:’array’\},’target’:\{’$ref’:’\#/$defs/Target’\},’p’:\{’description’:’Probabilityofthegoalproposal’,’maximum’:1,’minimum’:0,’title’:’P’,’type’:’number’\}\},’required’:\[’task\_name’,’objects’,’target’,’p’\],’title’:’GoalParticle’,’type’:’object’\},’Object’:\{’properties’:\{’type’:\{’enum’:\[’apple’,’chips’,’condimentbottle’,’cupcake’,’cutleryfork’,’plate’,’pudding’,’remotecontrol’,’salmon’,’waterglass’,’wineglass’\],’title’:’Type’,’type’:’string’\},’count’:\{’minimum’:1,’title’:’Count’,’type’:’integer’\}\},’required’:\[’type’,’count’\],’title’:’Object’,’type’:’object’\},’Target’:\{’properties’:\{’type’:\{’enum’:\[’coffeetable’,’dishwasher’,’fridge’,’kitchentable’,’stove’\],’title’:’Type’,’type’:’string’\}\},’required’:\[’type’\],’title’:’Target’,’type’:’object’\}\},’properties’:\{’particles’:\{’items’:\{’$ref’:’\#/$defs/GoalParticle’\},’title’:’Particles’,’type’:’array’\}\},’required’:\[’particles’\],’title’:’GoalParticles’,’type’:’object’\}

PleaseoutputtheminifiedJSON\.

Figure 9:A prompt example for Household Proactive Assistance\.

## Appendix BGridWorld Experiments

We provide the experimental details of our GridWorld Question Answering \(Section[5\.1](https://arxiv.org/html/2606.00240#S5.SS1)\) and Proactive Assistance \(Section[5\.2](https://arxiv.org/html/2606.00240#S5.SS2)\) experiments\.

### B\.1Environment Setup

We randomly generate episodes in a10×1010\\times 10grid world containingU\(0,20\)U\(0,20\)obstacles and88uniquely colored and shaped objects\. To ensure task complexity, generated episodes are filtered to guarantee sufficient trajectory length and goal ambiguity\. The resulting dataset comprises both rendered visual observations and detailed textual descriptions of the environment rules\.

All environments and agents accept explicit seeds\. We store environment configurations, initial states, and full action histories to reproduce any episode or visualization\.

### B\.2Data Generation

#### Question Answering

We formulate the QA task using binary\-choice questions with grounded natural language descriptions\. For each episode, we generate three distinct types of queries to test different aspects of social reasoning:

- •Type 1 & 2 \(Pre\-Pick\):Sampled at timesteps before the human picks up an object\. These questions query the model’s ability to infer the intended object to be picked \(given the placement goal\) or the overall goal configuration\.
- •Type 3 \(Post\-Pick\):Sampled at timesteps after the human is holding an object\. These questions query the intended placement target given the currently held object\.

We utilize 800 episodes \(2,400 questions\) for training and 100 episodes \(300 questions\) for evaluation\.

#### Proactive Assistance

For proactive assistance, the model is required to propose a full probability distribution over theNNcandidate goal pairs at each timestep, enabling real\-time intent inference without explicit questioning\. We useN=2N=2in the experiments\. To enhance visual grounding and standardize the goal representations, we impose a strict structural constraint on our model’s output\. Specifically, the model is instructed that the human agent consistently prioritizes interacting with the nearest object first\. Consequently, within each predicted goal hypothesis, the objects must be strictly ordered based on their initial proximity to the human’s starting position \(i\.e\., the closer object is explicitly designated as the first object, and the further one as the second\)\. This structured output formulation provides a stronger spatial inductive bias compared to the unconstrained inference prompts used for the pretrained baselines\.

We employ 1000 unlabeled episodes, unrolled into individual timesteps, for training the stepwise inference model\. Evaluation is performed on a separate set of 20 randomly sampled episodes to assess online assistance performance\.

### B\.3Agent Policies

#### Helping Planner

The helper assists the human by maintaining a goal distributionB=\{\(gi,pi\)\}B=\\\{\(g\_\{i\},p\_\{i\}\)\\\}over paired goals\. It selects actions using a Boltzmann policy based on the probability\-weighted expected return:Q\(a\)=∑ipi⋅V\(a∣gi\)Q\(a\)=\\sum\_\{i\}p\_\{i\}\\cdot V\(a\\mid g\_\{i\}\)\. The policy is designed to be complementary: it predicts which target the human will prioritize \(typically the closer one\) and aims for the other\. To ensure smooth collaboration, the helper follows heuristic rules to yield to the human, avoid blocking paths, and prevent deadlocks\.

#### Simulated Human Planner

The human agent employs a goal\-directed planner based on shortest\-path distances, operating sequentially by acquiring the proximal target and transporting it to a position adjacent to the distal target\. Actions are sampled via a Boltzmann policy with temperatureτ=0\.01\\tau=0\.01, subject to logical constraints \(e\.g\., mandatory object interactions\)\. To simulate physical load constraints in the proactive assistance task, the human adheres to an alternating “move\-then\-pause” pattern when carrying an object\. Furthermore, to mimic realistic human stochasticity and enhance trajectory diversity, we introduce a randomness factor of0\.150\.15during evaluation, where the agent takes a random action with15%15\\%probability\. To account for the stochasticity of our helping planner and simulated human planner, we evaluate the GridWorld proactive assistance task across three random seeds \(10, 20, and 30\) and report the averaged results\.

## Appendix CHousehold Experiments

We provide the experimental details of our Household Question Answering \(Section[5\.3](https://arxiv.org/html/2606.00240#S5.SS3)\) and Proactive Assistance \(Section[5\.4](https://arxiv.org/html/2606.00240#S5.SS4)\) experiments\.

### C\.1Environment Setup

We useVirualHome\(Puiget al\.,[2021](https://arxiv.org/html/2606.00240#bib.bib14)\)v2\.2\.4 as household simulator, where agent policies are implemented by a goal\-conditioned MCTS planner\. For online goal inference, following AutoToM\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib35)\), we use Sequential Monte Carlo algorithm to maintain the goal hypotheses over time\.

### C\.2Data Generation

#### Question Answering

We use the MMToM\-QA\(Jinet al\.,[2024](https://arxiv.org/html/2606.00240#bib.bib36)\)training set to construct training data forMindZero\. Since the test questions use binary choices, valid hypotheses may often lie outside the provided candidate set\. To better match this format, we apply hypothesis filtering to construct binary options instead of sampling from the full hypothesis space\. For goal\-related questions, we form choices by pairing a randomly sampled observed object with an unobserved one\. For belief\-related questions, we sample an unobserved object–container pair to create a binary verification task\. Applying this filtering strategy to the 953 training episodes yields a final dataset of 4,866 examples\.

#### Proactive Assistance

Following the standard setting ofVirualHome\(Puiget al\.,[2021](https://arxiv.org/html/2606.00240#bib.bib14)\), we use Apartment \#0, \#1, \#2, \#3, and \#5 for training data generation, and Apartment \#3 and \#6 for testing data generation\. We generate 20 episodes \(968 timesteps\) for training and 16 for testing, evenly distributed across four task types: setting up a table, loading the fridge, preparing food, and loading the dishwasher\.

## Appendix DHuman Experiment

We recruited 12 Johns Hopkins University students, including undergraduate, master’s, and Ph\.D\. students\. The pool included 5 male and 7 female participants\. All participants were at least 18 years old and able to operate a computer interface\. The study was approved by the Institutional Review Board \(IRB\)\. Prior to participation, each participant reviewed and signed an informed consent form\. Participation was voluntary, and participants could withdraw from the study at any time\.

Each study session took approximately 60 minutes\. Participants completed household tasks in a simulated apartment environment using a computer interface, as shown in Figure[10](https://arxiv.org/html/2606.00240#A4.F10)\. During the task, the system recorded task\-related interaction logs\.

![Refer to caption](https://arxiv.org/html/2606.00240v1/figures/human_experiment_interface.png)Figure 10:Human experiment interface for the Household Proactive Assistance domain\. The header reports the task, step budget, and episode; the left panel lets the participant navigate rooms and shows holding status, goal progress, and the helper agent’s state\. The center renders the agent’s view of the current room with an inset household map, and the right panel lists all visible objects with their spatial relations and open/closed states alongside the contextual action for the selected object\.While Section[6\.4](https://arxiv.org/html/2606.00240#S6.SS4)reports speedup averaged across tasks, we present per\-task results in Table[3](https://arxiv.org/html/2606.00240#A4.T3)\. Across all four tasks, MindZero trained from Qwen3\-4B yields a positive speedup, whereas the pretrained Qwen3\-4B model produces a negative speedup on Tasks 5 and 13, indicating that the same backbone without our training may even slow the human down\. The per\-task gap betweenMindZeroand Gemini\-3\-Flash is small and varies in sign, consistent with the absence of a statistically significant difference between the two on aggregate speedup\.

Table 3:Human experiment results in the Household Proactive Assistance domain\. We report average task\-completion steps for each condition and the corresponding speedup over the Single Human setting\. We useMindZerow/ Qwen3\-4B as the base model\.Task IDAverage StepsSpeedup \(%\)Qwen3\-4BMindZeroGemini\-3\-FlashSingle HumanQwen3\-4BMindZeroGemini\-3\-Flash35651507023\.6737\.5038\.41558443947\-18\.507\.6321\.558434144479\.2315\.457\.58131199791114\-3\.9218\.2826\.10Average––––2\.6219\.7023\.40Standard Error––––9\.006\.306\.40
## Appendix ETest\-Time Scaling Methods

### E\.1ThoughtTracing

For the Household Question Answering task, we evaluateThoughtTracingusing the original implementation, without any modifications to the codebase, including the prompts\. In contrast to the evaluation protocol reported in the original work, we conduct our testing on the complete, unmodified set of 600 test instances to ensure a fair comparison with other baselines and our main experiments\. For the GridWorld Question Answering task, which was not explored in the original work, we introduce only the necessary environment\-specific modifications\. AsThoughtTracingdoes not support direct visual input, we augment each question with explicit coordinate representations alongside an ASCII map of the environment\. This adaptation ensures that all essential visual information required for reasoning is preserved\.

### E\.2AutoToM

We evaluateAutoToMacross multiple backend models using the original implementation, without any modifications to the codebase, including the prompts\. Due to the limited instruction\-following capabilities of smaller models \(e\.g\., Llama\-3\.2\-3B\), parsing errors may occur\. When such errors arise, we adopt a uniform distribution as the inference result ofAutoToMto ensure a fair comparison\.

### E\.3Textual Transcripts

Specifically, for GridWorld Question Answering, as bothThoughtTracingandAutoToMdo not support multimodal inputs, we use textual transcripts to evaluate the performance\. See an example in Figure[11](https://arxiv.org/html/2606.00240#A5.F11)\.

YouareahelperagentinaGridWorldenvironment\.Youaretheredrobot,andtheHumanisthegreenrobot\.Therearemultipleobjects:brownstar,orangestar,yellowstar,pinkstar,greenstar,redstar,purplestar,andbluestar\.TheHuman’sgoalistoplacetwooftheobjectsnexttoeachother\.TheHumancanmoveup,down,left,right,orstay,andcanpickupanobjectwhenstandingonitandnotholdingone,andcanputdownanobjectwhenholdingoneandthecellisempty\.TheHuman’sactiontrajectorysofarisshownintheimage\.

Stateandtrajectorydetails:

Agents:

\-Humanpos:\(6,1\)

\-Helperpos:\(0,0\)

Obstacles:

\[\]

Objects\(bylabel\):

\{’brownstar’:\(5,9\),’orangestar’:\(8,2\),’yellowstar’:\(4,0\),’pinkstar’:\(9,1\),’greenstar’:\(5,5\),’redstar’:\(3,3\),’purplestar’:\(3,7\),’bluestar’:\(2,8\)\}

Actiondeltas\(dx,dy\):

\{’up’:\(0,1\),’down’:\(0,\-1\),’left’:\(\-1,0\),’right’:\(1,0\),’stay’:\(0,0\),’pick’:\(0,0\),’put’:\(0,0\)\}

Actiontrajectory\(human,name\+delta\):

t=1:left\(\-1,0\);t=2:down\(0,\-1\);t=3:down\(0,\-1\);t=4:left\(\-1,0\)

Actiontrajectory\(humanpositions\):

t=1:\(7,3\);t=2:\(7,2\);t=3:\(7,1\);t=4:\(6,1\)

ASCIIstate:

Step4

\.\.\.\.\.0\.\.\.\.

\.\.7\.\.\.\.\.\.\.

\.\.\.6\.\.\.\.\.\.

\.\.\.\.\.\.\.\.\.\.

\.\.\.\.\.4\.\.\.\.

\.\.\.\.\.\.\.\.\.\.

\.\.\.5\.\.\.\.\.\.

\.\.\.\.\.\.\.\.1\.

\.\.\.\.\.\.H\.\.3

P\.\.\.2\.\.\.\.\.

GiventhattheHumanintendstoplaceanobjectnexttothepurplestarat\(3,7\),whichobjectistheHumanmorelikelytopickupnext?\(a\)yellowstarat\(4,0\)\.\(b\)redstarat\(3,3\)\.

Figure 11:An example of textual transcript for GridWorld Question Answering\.

## Appendix FFull Results of Question Answering

While Figure[4](https://arxiv.org/html/2606.00240#S5.F4)provides an overview of the results forMindZeroand the baselines, we present the full results for our question answering experiments across two domains in Table[4](https://arxiv.org/html/2606.00240#A6.T4)below\.

Table 4:Full question answering results ofMindZeroand baselines on \(a\) GridWorld and \(b\) Household domains\. Best results overall and among open\-weight models are shown inboldandunderlined\. \* indicates methods with text\-only inputs\.\(a\)Gridworld Question AnsweringMethodAccuracy↑\\uparrowTFLOPs↓\\downarrowQwen3\-VL\-4B37\.73\.6Qwen3\-VL\-4B\-Think42\.767\.1Qwen3\-VL\-8B43\.37\.2Qwen3\-VL\-8B\-Think44\.7110\.9Qwen3\-VL\-235B\-A22B39\.321\.9Qwen3\-VL\-235B\-A22B\-Think44\.31767\.5GPT\-5\.250\.7ProprietaryGPT\-5\.2\-Think50\.7ProprietaryGemini\-3\-Flash68\.0ProprietaryGemini\-3\-Pro83\.7ProprietaryThoughtTracing\*\(Kimet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib38)\)w/ Qwen3\-VL\-4B50\.331\.0w/ Qwen3\-VL\-8B56\.754\.3w/ Qwen3\-VL\-235B\-A22B53\.0169\.8w/ GPT\-5\.257\.3Proprietaryw/ Gemini\-3\-Flash64\.0ProprietaryAutoToM\*\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib35)\)w/ Qwen3\-VL\-4B49\.3344\.4w/ Qwen3\-VL\-8B52\.3741\.2w/ Qwen3\-VL\-235B\-A22B44\.71089\.7w/ GPT\-5\.257\.3Proprietaryw/ Gemini\-3\-Flash47\.0ProprietaryMindZero\(Ours\)w/ Qwen3\-VL\-4B95\.03\.6w/ Qwen3\-VL\-8B92\.37\.2
\(b\)Household Question AnsweringMethodAccuracy↑\\uparrowTFLOPs↓\\downarrowLlama\-3\.1\-8B41\.312\.9Llama\-3\.2\-3B34\.84\.0Qwen3\-4B42\.810\.9Qwen3\-4B\-Think45\.041\.3Qwen3\-235B\-A22B54\.580\.4Qwen3\-235B\-A22B\-Think54\.02663\.0GPT\-5\.265\.0ProprietaryGPT\-5\.2\-Think73\.5ProprietaryGemini\-3\-Flash67\.2ProprietaryGemini\-3\-Pro60\.8ProprietaryThoughtTracing\(Kimet al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib38)\)w/ Llama\-3\.1\-8B44\.3571\.7w/ Llama\-3\.2\-3B43\.5232\.9w/ Qwen3\-4B54\.5291\.2w/ Qwen3\-235B\-A22B59\.82097\.9w/ GPT\-5\.268\.0Proprietaryw/ Gemini\-3\-Flash72\.3ProprietaryAutoToM\(Zhanget al\.,[2025](https://arxiv.org/html/2606.00240#bib.bib35)\)w/ Llama\-3\.1\-8B54\.0136\.3w/ Llama\-3\.2\-3B51\.023\.4w/ Qwen3\-4B54\.7177\.5w/ Qwen3\-235B\-A22B67\.5389\.9w/ GPT\-5\.276\.5Proprietaryw/ Gemini\-3\-Flash80\.2ProprietaryMindZero\(Ours\)w/ Llama\-3\.1\-8B76\.212\.9w/ Llama\-3\.2\-3B77\.84\.4w/ Qwen3\-4B72\.713\.1
MindZero: Learning Online Mental Reasoning With Zero Annotations

Similar Articles

RemoteZero: Geospatial Reasoning with Zero Human Annotations

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

Ring-Zero: Scaling Zero RL to a Trillion Parameters for Emergent Reasoning

G-Zero: Self-Play for Open-Ended Generation from Zero Data

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Submit Feedback

Similar Articles

RemoteZero: Geospatial Reasoning with Zero Human Annotations
TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
Ring-Zero: Scaling Zero RL to a Trillion Parameters for Emergent Reasoning
G-Zero: Self-Play for Open-Ended Generation from Zero Data
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision