Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Summary
This paper introduces Vigil, an evaluation framework for embodied agents that disentangles task execution success from the agent's ability to correctly recognize and report task completion.
View Cached Full Text
Cached at: 05/12/26, 07:22 AM
# Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents
Source: [https://arxiv.org/html/2605.08747](https://arxiv.org/html/2605.08747)
Ying Chen1Rui Jiang1Lihuang Fang1Mingxu Wang1 Zhifeng Gu1,2Lei Yi1Jie Chen1† 1XPENG Robotics2The Hong Kong Polytechnic University
###### Abstract
Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call*terminal commitment*\. Behaviorally distinct failures—never completing the task, completing it but failing to stop, and reporting success without sufficient evidence—collapse into the same benchmark failure\. We introduceVigil, an evaluation framework that makes terminal commitment independently measurable\. UnderVigil’s default protocol, agents observe only egocentric RGB, receive no action\-success signals, and must end each episode with a semantic report checked deterministically against hidden world state\. This yields two separate scores: world\-state completion \(W\) and benchmark success \(B\), whereBadditionally requires a correct terminal report\. This decoupling makes four outcome categories distinguishable: missed execution, post\-attainment drift, unsupported commitment, and verified success\. Across 20 models on 1,000 frozen episodes, systems with comparableWdiffer by up to 19\.7 pp inB: one model converts achieved states into correct reports, while another with near\-identical execution drifts past the goal without closing\. An action\-feedback intervention further tests the separation: execution\-oriented signals improveWbroadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state\.Vigilprovides a protocol that makes terminal commitment independently visible and scorable\.11footnotemark:1
![[Uncaptioned image]](https://arxiv.org/html/2605.08747v1/assets/logo.png)
Done, But Not Sure: Disentangling World Completion from Self\-Termination in Embodied Agents
Ying Chen1Rui Jiang1Lihuang Fang1Mingxu Wang1 Zhifeng Gu1,2Lei Yi1Jie Chen1† 1XPENG Robotics2The Hong Kong Polytechnic University
††footnotetext:Corresponding author\.‡‡footnotetext:The benchmark and code will be released\.## 1Introduction
Embodied agents must not only take actions to complete a task, but also determine when the task has been completed and commit to that judgment\. This is non\-trivial when agents operate under partial observability and must infer task progress from limited egocentric observations over time\[[1](https://arxiv.org/html/2605.08747#bib.bib1),[2](https://arxiv.org/html/2605.08747#bib.bib2)\], without action feedback or success signals\[[3](https://arxiv.org/html/2605.08747#bib.bib3),[4](https://arxiv.org/html/2605.08747#bib.bib4)\]\. For example, in a task that asks the agent to turn on a lamp, the agent may successfully turn it on yet continue navigating because it fails to recognize that the task is already complete\. Under current embodied evaluation, this outcome is often indistinguishable from never having completed the task\.
Figure 1:Controlled evaluation protocol\.Vigilcontains eight task families: a diagnostic tier \(PG, DA, SV, VS\) that isolates a single bottleneck, and a compositional tier \(AI, SI, SM, CR\) that combines them in multi\-step interaction\. All episodes use strict first\-person observation and mandatoryreporttermination\.To our knowledge, no existing embodied benchmark cleanly separates task\-closure failures from execution failures\[[5](https://arxiv.org/html/2605.08747#bib.bib5),[6](https://arxiv.org/html/2605.08747#bib.bib6),[7](https://arxiv.org/html/2605.08747#bib.bib7)\]\. Recent work has improved per\-skill and capability\-level diagnostics\[[8](https://arxiv.org/html/2605.08747#bib.bib8),[9](https://arxiv.org/html/2605.08747#bib.bib9),[4](https://arxiv.org/html/2605.08747#bib.bib4),[10](https://arxiv.org/html/2605.08747#bib.bib10)\], yet overall model comparison is still summarized by aggregated task\- or episode\-level success metrics\. These metrics collapse behaviorally distinct cases into the same final outcome: the agent may fail to complete the task, complete it but fail to stop, or declare completion without sufficient evidence—failures with different causes and remedies that current metrics conflate\.
We introduceVigil, an evaluation framework that makesterminal commitmentindependently measurable\. The design has three key elements\. First, agents observe only egocentric RGB, without privileged state, oracle progress signals, or action\-success confirmation\. Second, each episode ends with a mandatory semanticreportwhose content is checked deterministically against the hidden world state\. Third, this yields two separate scores: world\-state completion \(W\) and benchmark success \(B\), whereBadditionally requires a correct terminal commitment\. The no\-feedback contract is essential: if agents received action\-success signals, correct reports could be produced by echoing environment confirmation rather than by maintaining task\-state judgment from observation\.
This target differs from prior work on self\-assessment and uncertainty in language agents\[[11](https://arxiv.org/html/2605.08747#bib.bib11),[12](https://arxiv.org/html/2605.08747#bib.bib12),[13](https://arxiv.org/html/2605.08747#bib.bib13)\]\. Rather than inferring latent confidence or internal belief\[[11](https://arxiv.org/html/2605.08747#bib.bib11)\], we evaluate whether the agent expresses the correct task\-state judgment at episode closure through a terminal report that can be checked against the hidden world state\. For state\-verification tasks, where the agent must reportopen/closedoron/off, the correct terminal act is a categorical state judgment, not a binary decision to halt\. Consequently, a stop\-only protocol cannot represent this class of failure, because the error lies in report content rather than in termination timing\.
Vigilevaluates this dimension over eight task families that probe the conditions under which terminal judgment becomes difficult \(Figure[1](https://arxiv.org/html/2605.08747#S1.F1)\), spanning target visibility, distance, state uncertainty, temporal dependency, and physical constraint\. A diagnostic tier \(short budgets, single bottlenecks\) isolates closure failures when execution is tractable, while a compositional tier \(longer budgets, chained prerequisites\) reveals when execution floors mask closure errors\. We additionally use an action\-feedback intervention, modeled on proprioceptive signals available to physical robots, to test whether closure failures can be explained by upstream execution traps alone\.
#### Findings\.
Across 20 multimodal systems spanning open\-weight and closed\-source frontier models \(10 anchor models in the main text; full panel in Appendix[F](https://arxiv.org/html/2605.08747#A6)\), we find that execution and terminal commitment are empirically separable:
- •Structured closure\-failure profiles\.Models exhibit distinct terminal behaviors, including premature false commitments \(FR\), chronic no\-report exhaustion \(NR\), and selective reporting\. These profiles are invisible under aggregate success and stable across prompt variants\.
- •Execution floors mask closure failures\.On longer compositional tasks, execution often fails before terminal judgment can be meaningfully evaluated, compressing the observable gap and masking closure errors that the diagnostic tier isolates directly\.
- •Execution feedback is not a universal fix\.A proprioceptive action\-feedback intervention reduces execution traps broadly, but improves terminal reporting only for models whose terminal reports are already coupled to achieved task state\.
Together, these results identify terminal commitment as a distinct failure dimension: it is empirically separable from execution, produces consistent step\-level patterns, and is not uniformly repaired by improving execution alone\. To the best of our knowledge,Vigilprovides the first evaluation protocol that makes this dimension independently measurable, enabling targeted diagnosis and comparison of terminal judgment across embodied systems\.
## 2Related Work
Vigilintersects three lines of work: embodied evaluation, self\-assessment and confidence\-related control, and belief\-oriented embodied reasoning\. Its scientific target is distinct: whether the agent can correctly judge and report its achieved task state at episode closure, scored independently of execution success\. Table[1](https://arxiv.org/html/2605.08747#S2.T1)summarizes how existing settings compare along this axis\.
Table 1:Representative embodied evaluation settings and whether they make terminal task\-state judgment externally scorable\.Vigilmakes agent\-side terminal commitment independently scorable under a no\-feedback native\-control contract\.Column definitions:*Native*—agent acts through natural skill calls rather than privileged API commands\.*Fine\-Grained*—per\-skill diagnostics beyond aggregate success\.*Active Termination*—“Stop”: bare stop action absorbed into task success; “Report”: semantic terminal judgment independently scored; ✗: evaluator\-side termination only\.*Decoupled*—“Task”: per\-skill decomposition without independent report scoring; “Task\+Report”: adds deterministically scored terminal commitment\.*No Feedback*—no external confirmation of task progress, goal completion, or action success\.
#### Embodied Evaluation Protocols\.
Household instruction\-following benchmarks define long\-horizon tasks evaluated by terminal success\[[5](https://arxiv.org/html/2605.08747#bib.bib5),[14](https://arxiv.org/html/2605.08747#bib.bib14),[17](https://arxiv.org/html/2605.08747#bib.bib17),[18](https://arxiv.org/html/2605.08747#bib.bib18)\]\. Subsequent work extends the setting across simulation fidelity\[[16](https://arxiv.org/html/2605.08747#bib.bib16)\], LLM\-based planning\[[6](https://arxiv.org/html/2605.08747#bib.bib6)\], navigation\[[7](https://arxiv.org/html/2605.08747#bib.bib7)\], and compositional manipulation\[[15](https://arxiv.org/html/2605.08747#bib.bib15)\]\. Recent suites add per\-skill diagnostics\[[8](https://arxiv.org/html/2605.08747#bib.bib8),[4](https://arxiv.org/html/2605.08747#bib.bib4),[9](https://arxiv.org/html/2605.08747#bib.bib9),[10](https://arxiv.org/html/2605.08747#bib.bib10)\], improving attribution across perception, navigation, and manipulation\. The key gap remains: terminal success is an evaluator\-side predicate over world state, not an agent\-side judgment that is independently scored\. Execution failures and closure failures are therefore behaviorally conflated\.
#### Self\-Assessment, Confidence, and Termination Decisions\.
Several benchmarks include astopaction\[[5](https://arxiv.org/html/2605.08747#bib.bib5),[6](https://arxiv.org/html/2605.08747#bib.bib6),[7](https://arxiv.org/html/2605.08747#bib.bib7)\], but stopping is a control primitive absorbed into task success without independent evaluation\. A parallel literature asks whether language models possess calibrated self\-knowledge\[[11](https://arxiv.org/html/2605.08747#bib.bib11)\]and whether confidence signals can trigger help\-seeking or replanning\[[12](https://arxiv.org/html/2605.08747#bib.bib12),[13](https://arxiv.org/html/2605.08747#bib.bib13)\]\.Vigiltargets a different observable: not latent confidence, but whether the agent produces a semantically correct terminal report that can be verified against hidden world state\.
#### Belief Under Embodied Interaction\.
Spatial intelligence in vision\-language models is increasingly studied beyond static images, spanning metric reasoning\[[19](https://arxiv.org/html/2605.08747#bib.bib19),[20](https://arxiv.org/html/2605.08747#bib.bib20)\], perspective\-taking\[[21](https://arxiv.org/html/2605.08747#bib.bib21),[22](https://arxiv.org/html/2605.08747#bib.bib22),[23](https://arxiv.org/html/2605.08747#bib.bib23)\], and robotics\-oriented grounding\[[24](https://arxiv.org/html/2605.08747#bib.bib24),[25](https://arxiv.org/html/2605.08747#bib.bib25),[26](https://arxiv.org/html/2605.08747#bib.bib26),[27](https://arxiv.org/html/2605.08747#bib.bib27)\]\. These settings show that embodied interaction requires building and updating representations from partial observations—a capacity that remains difficult for current models\[[28](https://arxiv.org/html/2605.08747#bib.bib28),[29](https://arxiv.org/html/2605.08747#bib.bib29)\]\. Recent work makes this explicit: Theory of Space\[[1](https://arxiv.org/html/2605.08747#bib.bib1)\]asks whether foundation models construct spatial beliefs through active exploration; CubeBench\[[2](https://arxiv.org/html/2605.08747#bib.bib2)\]diagnoses interactive spatial reasoning under partial observation\.Vigilshares this belief\-under\-interaction perspective but targets a different output: task\-state judgment at closure, not spatial belief during exploration\.
## 3Benchmark Design
Vigilseparates two outcomes that standard embodied evaluation conflates: whether the agent changed the world correctly, and whether it issued a correct terminal report about that change\. This requires controlled task families with hidden goal predicates, a no\-feedback interaction contract, and a deterministic scoring rule that evaluates world state and report content independently\.
### 3\.1Task Families
Each frozen episode specifies a task instruction, a family label, step and invalid\-action budgets, and a hidden success condition𝒢\\mathcal\{G\}checked against the simulator state\. The benchmark contains 1,000 episodes across eight balanced families \(125 each\), organized as controlled probes over the factors that shape task\-state judgment under interaction \(Figure[1](https://arxiv.org/html/2605.08747#S1.F1)\)\.
Rather than decomposing by domain\-specific subtasks—as in recent skill\-diagnostic benchmarks\[[4](https://arxiv.org/html/2605.08747#bib.bib4),[10](https://arxiv.org/html/2605.08747#bib.bib10)\]—we decompose along atomic perceptual\-motor capacities and progressively compose them\. This makes attribution precise: when a model fails on a composed task but succeeds on its constituents, the bottleneck lies in composition or terminal judgment, not in the constituent skills\.
The*diagnostic tier*isolates single bottlenecks with short budgets \(5–20 steps\):PG\(pixel grounding\)—click the correct visible object;DA\(distance approach\)—navigate to a visible target;VS\(view search\)—find a non\-visible target through active exploration;SV\(state verification\)—report the categorical state of a visible object without physical interaction\. SV provides the purest test of terminal commitment: world\-state completion exceeds 80% for most models, so failures inBdirectly expose judgment errors\.
The*compositional tier*chains these capacities under longer budgets \(25–40 steps\):AI\(approach\-and\-interact = DA \+ interaction\);SI\(search\-and\-interact = VS \+ DA \+ interaction\);SM\(sequential manipulation—multi\-step pick\-and\-place\);CR\(constraint resolving—obstacle removal before interaction\)\. Full family specifications are in Appendix[D](https://arxiv.org/html/2605.08747#A4)\.
### 3\.2Native\-Control Contract
Episodes run in AI2\-THOR\[[30](https://arxiv.org/html/2605.08747#bib.bib30)\]with ProcTHOR\[[31](https://arxiv.org/html/2605.08747#bib.bib31)\]houses\. The agent receives a single egocentric RGB frame per step and acts through four skills: locomotion \(navigate,look\), pixel\-grounded object interaction \(interact\_pixel\), and terminal reporting \(report\)\. No privileged state or action\-success feedback is provided: the agent receives no maps, semantic masks, absolute pose, or oracle progress signals\. State changes are visible only through subsequent RGB observations\.
Following the dialogue\-based native\-control paradigm of recent embodied evaluation\[[4](https://arxiv.org/html/2605.08747#bib.bib4),[10](https://arxiv.org/html/2605.08747#bib.bib10)\], the agent maintains a bounded dialogue history of up to 20 turns \(one turn per action step\)—including any agent\-generatedthoughtorcognitive\_statefields—as the only cross\-step memory\. The 20\-turn bound is chosen to match the step budgets of our longest compositional tasks \(25–40 steps\) while remaining within the reliable context\-utilization range of current multimodal models\.
This contract is intentionally sparse: terminal judgments must be formed from the agent’s own partial\-observation interaction history rather than from evaluator\-side success signals\.
Figure 2:Evaluation pipeline\.The top row illustrates an example trajectory\. The bottom row summarizes the per\-step interface: the agent acts from only the current egocentric RGB frame, the task instruction, and bounded dialogue history, without action\-success or goal\-completion feedback\. A terminalreportis evaluated against the hidden world\-state condition by deterministic rules\.The key mechanism is the terminalreport: it carries a categoricalstatus\(e\.g\.,success,fail,on/off,open/closed\) and a short summary\. Unlike a barestop,reportrecords*what*the agent judges the task state to be at termination\. This makes the terminal act externally scorable: the evaluator can check whether the reported judgment matches the hidden world state, rather than merely whether termination coincided with goal satisfaction\. This matters most in state\-verification tasks, where the correct terminal act is a categorical state judgment rather than a binary decision to end\.
### 3\.3Disaggregated Outcomes
We separate two outcomes at episode closure: whether the task condition is satisfied in the world, and whether the agent’s terminal commitment correctly reflects that state\.
For trajectoryτ\\tau, letsTs\_\{T\}be the hidden terminal state,𝒢sem\\mathcal\{G\}\_\{\\mathrm\{sem\}\}the primary semantic task\-goal predicate,aTa\_\{T\}the terminal action, andreportT\\texttt\{report\}\_\{T\}the report content\. The report status is normalized to one of eight categorical values \(success,fail,on,off,open,closed,unsafe,invalid\) and matched against the hidden state by a deterministic rule\. For goal\-completion tasks,successmust coincide withW=1W\\\!=\\\!1; for state\-verification tasks, where𝒢sem\\mathcal\{G\}\_\{\\mathrm\{sem\}\}denotes the queried state predicate rather than a manipulation goal, the status must exactly match the hidden object state \(e\.g\.,oniff the target is toggled on\)\. We define:
W\(τ\)\\displaystyle W\(\\tau\)=𝕀\[sT⊧𝒢sem\],\\displaystyle=\\mathbb\{I\}\[s\_\{T\}\\models\\mathcal\{G\}\_\{\\mathrm\{sem\}\}\],B\(τ\)\\displaystyle B\(\\tau\)=𝕀\[sT⊧𝒢sem∧aT=report∧match\(reportT,sT\)\]\.\\displaystyle=\\mathbb\{I\}\\\!\\left\[s\_\{T\}\\models\\mathcal\{G\}\_\{\\mathrm\{sem\}\}\\;\\land\\;a\_\{T\}=\\texttt\{report\}\\;\\land\\;\\texttt\{match\}\(\\texttt\{report\}\_\{T\},s\_\{T\}\)\\right\]\.WWcaptures world\-state completion at termination under the primary semantic task predicate\.BBcaptures benchmark success, requiring not only the correct semantic world state but also a matching terminal commitment\. If an episode ends by budget exhaustion or invalid\-action limit beforereport, thenB=0B=0regardless of the world state\. The gapΔ=W−B\\Delta=W\-Bisolates achieved task states not converted into benchmark success through correct terminal commitment\.
Importantly,Δ\\Deltais a decomposition handle rather than a standalone reliability measure: a small gap may reflect either reliable state\-coupled commitment or aggressive reporting that coincidentally aligns with lowWW\. We therefore analyzeΔ\\Deltajointly with deterministic closure\-failure labels:False reports\(FR\) occur when the report claims a status unsupported by the hidden state\.No\-report exhaustion\(NR\) occurs when the step budget expires without areport\.Invalid\-action\-limitterminations \(IL\) occur when cumulative malformed actions exceed the family budget\. Appendix[C\.2](https://arxiv.org/html/2605.08747#A3.SS2)summarizes the reported labels used in the figures and tables\.
#### Interpretation Note\.
VIGIL does not claim that every failure labeled byB=0B=0uniquely identifies an internal cognitive cause\. Rather, it provides an externally testable decomposition of closure behavior: whether the agent achieved the relevant task state, whether it issued a terminal judgment, and whether that judgment matched the achieved world state\. This decomposition allows closure failures to be compared systematically across models and interventions\.
## 4Experiments
All experiments use the native\-control contract from §[3](https://arxiv.org/html/2605.08747#S3): egocentric RGB input, bounded dialogue history, no environment feedback, and mandatoryreporttermination\. Main\-text world\-state results \(W\) use the primary semantic goal predicate; dual\-metric scoring details are in Appendix[C](https://arxiv.org/html/2605.08747#A3)\. We evaluate a frozen 1,000\-episode set \(eight families, 125 each\) on 10 anchor models spanning closed\-source frontier \(Gemini\-3\.1\-Pro\[[32](https://arxiv.org/html/2605.08747#bib.bib32)\], Doubao\-Seed\-1\.8\[[33](https://arxiv.org/html/2605.08747#bib.bib33)\], GPT\-5\.4\[[34](https://arxiv.org/html/2605.08747#bib.bib34)\], Claude\-Sonnet\-4\[[35](https://arxiv.org/html/2605.08747#bib.bib35)\]\), open general VLMs \(Qwen3\.6\-27B\[[36](https://arxiv.org/html/2605.08747#bib.bib36)\], Qwen3\.5\-27B\[[37](https://arxiv.org/html/2605.08747#bib.bib37)\]—both served with thinking enabled—Qwen3\-VL\-32B\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]in Instruct and Thinking variants, and InternVL3\.5\-38B\[[39](https://arxiv.org/html/2605.08747#bib.bib39)\]\), and embodied\-tuned systems \(MiMo\-Embodied\-7B\[[40](https://arxiv.org/html/2605.08747#bib.bib40)\]\)\. This spread tests whether terminal commitment failures are specific to a model class or appear broadly across architectures, scales, and training objectives; the primary 20\-model comparison panel is in Appendix[F](https://arxiv.org/html/2605.08747#A6)\.
The empirical goal is not only to compare benchmark success, but to characterize how embodied models fail at episode closure\. We ask three questions: do execution and terminal commitment separate empirically \(§[4\.1](https://arxiv.org/html/2605.08747#S4.SS1)\)? Do models exhibit distinct closure\-failure profiles with interpretable step\-level signatures \(§[4\.2](https://arxiv.org/html/2605.08747#S4.SS2)\)? Are these closure failures explained entirely by upstream execution traps \(§[4\.3](https://arxiv.org/html/2605.08747#S4.SS3)\)?
### 4\.1Cross\-Model Outcomes: Execution and Terminal Commitment Separate
Table 2:Per\-family outcomes for 10 anchor models under native control\.Each cell reportsW/B\(%\), whereW= world\-state completion andB= benchmark success\.𝚫\\boldsymbol\{\\Delta\}=W−BW\\\!\-\\\!Bgap \(pp\)\. Diagnostic: PG = pixel grounding, DA = distance approach, VS = view search, SV = state verification\. Compositional: AI = approach\-and\-interact, SI = search\-and\-interact, SM = sequential manipulation, CR = constraint resolving\. Anchor set shared with all subsequent analyses; primary 20\-model comparison panel in Appendix[F](https://arxiv.org/html/2605.08747#A6)\.\(T\)marks thinking serving mode; Qwen3\.6\-27B and Qwen3\.5\-27B are served with thinking enabled and not additionally marked\.†Closed\-source API models\. Identifiers:gemini\-3\.1\-pro\-preview,doubao\-seed\-1\.8,gpt\-5\.4,claude\-sonnet\-4\-20250514\.
Table[2](https://arxiv.org/html/2605.08747#S4.T2)summarizes per\-family outcomes for the 10 anchor models used throughout all analyses\.
#### Execution and terminal commitment separate empirically\.
Comparable levels of world\-state completion can produce sharply different benchmark success\. Doubao\-Seed\-1\.8 and InternVL3\.5\-38B reach nearly identicalW\(40\.4% vs\. 39\.9%\), yet differ by 19\.7 pp inB\(37\.1% vs\. 17\.4%\); their sharply different closure gaps \(Δ=3\.3\\Delta\\\!=\\\!3\.3vs\. 22\.5 pp\) explain this benchmark\-success divergence\. A related decoupling appears between Qwen3\.6\-27B \(46\.7%/37\.8%\) and InternVL3\.5\-38B \(39\.9%/17\.4%\)\. The reverse also holds: Qwen3\-VL\-32B\(T\)has lowerWthan its instruct counterpart \(30\.7% vs\. 39\.2%\) but higherB\(26\.4% vs\. 18\.9%\), showing that higher execution does not guarantee higher benchmark success\.
The factorial task structure enables further attribution: because diagnostic families isolate single capacities, we can localize where closure breaks down in composed tasks\. Qwen3\.6\-27B performs well on PG \(87\.2%\) and reasonably on DA \(56\.8%\) individually, but on AI attains 54\.4%Wand only 28\.8%B, suggesting that the composed\-task bottleneck is not merely constituent navigation or grounding but also conversion of successful interaction into terminal commitment\. For open\-weight configurations with paired robustness runs, prompt\-sensitivity and repeated\-run checks show that the main closure profiles are stable across prompt variants and reruns \(Appendix[K](https://arxiv.org/html/2605.08747#A11)\)\.
#### Compositional tasks reveal where execution floors mask closure failures\.
LargeWW–BBseparations concentrate in diagnostic probes and the simplest compositional task \(AI\), where execution remains high enough for closure behavior to be observable\. On deeper compositional tasks \(SI, SM, CR\), the gap shrinks because successful world\-state attainment itself becomes rare—sinceΔ\\Deltais bounded byW, low execution compresses the observable signature of closure failure\. The compositional tier therefore shows when execution becomes the dominant bottleneck and masks the closure failures that shorter episodes isolate directly\.
### 4\.2Terminal Commitment: Structured Failures and Step\-Level Evidence
We now decompose terminal failure modes and examine their step\-level signatures to determine whether closure failures have interpretable behavioral structure or are merely stochastic noise\.
Figure 3:Episode outcome partition for 10 anchor models\(sorted byB\)\. FR\-heavy profiles \(Claude\-Sonnet\-4, GPT\-5\.4\) and NR\-heavy profiles \(InternVL3\.5\-38B, Qwen3\-VL\-32B\) are immediately distinguishable despite comparable aggregate success\. Label definitions are in Appendix[C\.2](https://arxiv.org/html/2605.08747#A3.SS2); counterfactual report\-policy analysis is in Appendix[H](https://arxiv.org/html/2605.08747#A8)\.#### False commitments and missed commitments define distinct closure regimes\.
Figure[3](https://arxiv.org/html/2605.08747#S4.F3)shows that terminal failure takes two dominant forms: unsupported commitment \(FR—the agent issues a terminal report whose status is contradicted by the hidden world state\) and missed terminal commitment \(NR—the agent exhausts the budget without issuing a report\)\. WithinW=1W\\\!=\\\!1episodes, NR corresponds to post\-attainment drift\. FR\-heavy models \(GPT\-5\.4 at 54\.0%, Claude\-Sonnet\-4 at 69\.9%\) frequently issue terminal judgments unsupported by the achieved world state\. NR\-heavy models \(InternVL3\.5\-38B at 66\.1%, Qwen3\-VL\-32B at 44\.6%, MiMo\-Embodied\-7B at 43\.1%\) often exhaust the budget without a terminal report; among episodes where the world goal is reached, this appears as failure to convert task\-state attainment into terminal commitment\. Qwen3\-VL\-32B combines both modes—substantial FR and NR—producing aΔ\\Deltaof 20\.3 pp despite nontrivialW, demonstrating that the two failure regimes are not mutually exclusive\.
#### State verification makes semantic terminal judgment indispensable\.
SV episodes provide the clearest test of report content because the object state is already set at episode start and no physical interaction is required beyond observation:Wexceeds 80% for 9 of 10 anchors \(Table[2](https://arxiv.org/html/2605.08747#S4.T2)\)\. YetBdrops by 5–31 pp, with the largest gaps in Claude\-Sonnet\-4 \(73\.6→\\to45\.6\), GPT\-5\.4 \(92\.8→\\to72\.0\), and InternVL3\.5\-38B \(86\.4→\\to55\.2\)\. Because the correct terminal output is categorical \(on/off,open/closed\), a stop\-only protocol can observe whether the agent terminates but not whether its terminal state judgment is semantically correct—semantic terminal evaluation is required to capture this class of embodied judgment error\.
Figure 4:Terminal\-commitment profiles\(sorted byBdescending\)\. \(a\) Mean belief lag: steps between first world\-goal satisfaction and the correct terminal report\. When agents do report correctly, they do so within 0\.9–1\.9 steps of that event \(panel \(a\) rounds each model to one decimal place\)\. \(b\) AmongW=1W\\\!=\\\!1episodes, percentages are the fraction of each primitive action type among all steps*after*the world goal is first satisfied \(episode\-level action counts are pooled; bars sum to 100%\)\. NR\-heavy models keep issuing navigation and pixel\-interaction commands rather than closing, whereas low\-Δ\\Deltamodels concentrate onreport\.
#### A single scalar cannot characterize closure reliability\.
Gemini\-3\.1\-Pro and Claude\-Sonnet\-4 have similarly small gaps \(Δ≈1\\Delta\\\!\\approx\\\!1–44pp; Table[2](https://arxiv.org/html/2605.08747#S4.T2)\), yet their closure regimes differ sharply \(Figure[3](https://arxiv.org/html/2605.08747#S4.F3)\): Gemini combines highW\(57\.7%\) with moderate FR \(24\.6%\), whereas Claude\-Sonnet\-4 reaches much lowerW\(25\.2%\) but reports aggressively \(FR==69\.9%\)\. Claude\-Sonnet\-4’s small gap is not evidence of reliable closure; it reflects frequent terminal commitment despite weak state support, where lowWlimits the number of achieved states available to expose missed conversions\. We therefore interpret closure behavior through the joint pattern overW,B, FR, NR, and IL rather than any single scalar\.
#### When correct, closure follows task\-state attainment quickly\.
Figure[4](https://arxiv.org/html/2605.08747#S4.F4)a shows a closure\-latency proxy: the number of steps between first world\-goal satisfaction and the terminalreport, for episodes ending in a correct report\. Within this correctly closed subset, the lag ranges from 0\.9 to 1\.9 steps across all 10 anchor models, indicating that successful closure is typically prompt once task\-state attainment is reached\. When correct closure occurs at all, it is prompt; the main failures are missed or incorrect closure rather than slow eventual reporting\. Exact counts are in Appendix[I](https://arxiv.org/html/2605.08747#A9)\.
#### Most false\-success commitments are unsupported by task progress\.
To characterize premature success claims at the step level, we examine the world\-state progress recorded by the per\-step evaluator at the moment of each false\-success report\. Across all 10 anchor models, 65–88% of false\-success reports occur at exactly zero task progress: the agent has not navigated closer to the target, changed any task\-relevant object state, or otherwise advanced the world toward the goal \(Table[13](https://arxiv.org/html/2605.08747#A9.T13)\)\. These are unsupported terminal commitments issued in the absence of task\-relevant world\-state change\. The remaining false\-success reports occur mostly at intermediate progress, typically after partial navigation or failed interaction; false\-success commitments at high progress \(\>\>0\.75\) are rare or absent across all models\. The prevalence of zero\-progress false\-success reports is consistent with two interpretations: the model may genuinely misjudge task state, or it may treat the report as a default exit action regardless of task\-state evidence\. VIGIL’s behavioral interface cannot disambiguate these causes, but either mechanism yields closure failures that aggregate success would mask\.
#### NR\-heavy models drift after task\-state attainment instead of closing\.
For the NR failure mode, we examine what agents do after the world condition is already satisfied \(Figure[4](https://arxiv.org/html/2605.08747#S4.F4)b\)\. InternVL3\.5\-38B has 201 NR episodes among its 399 world\-goal\-met episodes; the majority of post\-completion actions are navigation commands, with only a small fraction beingreport\. A similar pattern holds for Qwen3\-VL\-32B and MiMo\-Embodied\-7B\. By contrast, Gemini\-3\.1\-Pro issuesreportas the dominant post\-completion action, consistent with its low NR count \(4 episodes\) and smallΔ\\Delta\. This suggests that NR is not primarily a timing artifact near the budget boundary; it is post\-attainment drift in which achieved task state fails to trigger closure\.
### 4\.3Do Closure Failures Persist After Partial Execution Improvement?
The preceding analysis identifies distinct terminal\-failure profiles under the no\-feedback contract\. A natural question is whether these profiles are downstream consequences of recurrent execution traps \(e\.g\.,path\_blocked,too\_far\) that consume the step budget before reliable closure can occur\. To test this, we add a minimal action\-feedback intervention: two booleans after each action—too\_far\(interaction beyond the proximity threshold\) andpath\_blocked\(navigation obstructed\)\. These signals operationalize proprioceptive and low\-level controller feedback in physical robots \(e\.g\., out\-of\-reach manipulation and blocked motion\)\. Crucially, they expose execution outcomes only, not goal completion, task progress, or whether a report should be issued\. If closure failures persist, they are not fully explained by execution traps alone\. Table[3](https://arxiv.org/html/2605.08747#S4.T3)reports paired runs for all 10 anchor models\.
Table 3:Diagnostic intervention: action feedback modeled on proprioceptive signals\.Base = no\-feedback contract; \+FB = addstoo\_farandpath\_blockedboolean signals\. Event counts are mean per\-episode values;Δ\\DeltaW/Δ\\DeltaB/Δ\\DeltaFR/Δ\\DeltaNR in pp\.#### Execution improvement does not propagate uniformly to closure\.
Nine of ten models reducepath\_blockedevents under feedback, and most reducetoo\_far, confirming that the signal is consumed\. However, Gemini\-3\.1\-Pro, Doubao\-Seed\-1\.8, Qwen3\.5\-27B—and partially Qwen3\.6\-27B—show large gains inB\(up to\+\+13 pp\) together with substantial gains inW\. For the first three,Wrises by up to\+\+14 pp; Qwen3\.6\-27B attains\+\+12\.1 pp inBbut only\+\+8\.0 pp inW\. This suggests that partial execution repair can unlock correct terminal commitment when closure behavior is already state\-coupled\. We operationalize state\-coupled closure as follows: under the baseline no\-feedback contract, Gemini, Doubao, Qwen3\.6, and Qwen3\.5 combine moderateΔ\\Delta\(1\.3–8\.9 pp\) with aggregate FR far below GPT\-5\.4/Claude profiles \(Gemini 24\.6%, Doubao 37\.2%, Qwen3\.6 18\.4%, Qwen3\.5 25\.5%; Figure[3](https://arxiv.org/html/2605.08747#S4.F3)\)\. Their dominant bottleneck is reaching the goal state; feedback removes execution traps \(reducingpath\_blockedby 40–60%\), enabling more episodes to reach goal satisfaction—and because closure is already state\-coupled, the additional achieved states are promptly converted into correct reports\. Qwen3\.6\-27B is informative: itsΔ\\DeltaB \(\+\+12\.1 pp\) exceedsΔ\\DeltaW \(\+\+8\.0 pp\), indicating that improved execution disproportionately resolves missed closures over newly reached goals under this pairing\.
#### NR\-heavy and FR\-heavy closure failures persist under partial execution repair\.
The remaining six models show a dissociation between execution improvement and terminal commitment, through three patterns\.*FR\-heavy persistence*: GPT\-5\.4 gains \+4\.5 pp inWbut only \+2\.3 pp inB; its FR barely moves \(Δ\\DeltaFR==−\-3\.6 pp, remaining above 50%\)\.*FR→\\toNR conversion*: Claude\-Sonnet\-4 is the most extreme case—feedback increases mean steps from 6\.8 to 10\.8 \(paired logs in Appendix[J](https://arxiv.org/html/2605.08747#A10)\) and FR drops by 23 pp, but NR rises by 8\.5 pp, consistent with premature commitments being suppressed without being replaced by correct reports, yielding only \+0\.9 pp inB\.*NR\-heavy persistence*: InternVL3\.5\-38B reducespath\_blockedfrom 11\.94 to 7\.38 per episode yet gains only \+1\.5 pp inB; its NR remains unchanged \(Δ\\DeltaNR==\+0\.7 pp\), indicating that the closure bottleneck is not fully explained by navigation difficulty\. MiMo\-Embodied\-7B and the two Qwen3\-VL\-32B variants follow the same pattern: all gain modestly inW\(\+2\.6 to \+4\.0 pp\) with minimal or zeroBimprovement, and FR either holds steady or*increases*, suggesting that improved reachability can expose premature commitments rather than automatically produce correct reports \(full deltas in Table[3](https://arxiv.org/html/2605.08747#S4.T3)\)\. These results show that improving execution is sometimes necessary, but not sufficient, for terminal commitment—the two dimensions require separate diagnosis and intervention\.
## 5Conclusion and Limitations
Vigilmakes terminal commitment independently measurable by separating world\-state completion from report correctness under a strict first\-person, no\-feedback contract\. Across 20 models on 1,000 frozen episodes, execution and terminal commitment separate systematically: models with comparable world\-state completion differ by up to 19\.7 pp in benchmark success, and an action\-feedback intervention modeled on proprioceptive signals improvesWbroadly but leaves closure failures intact for models whose terminal behavior is not state\-coupled\. These findings confirm that today’s embodied models can achieve task\-relevant world states yet fail to convert them into correct terminal reports—and no prior evaluation makes this visible\.
#### Limitations\.
All experiments run in AI2\-THOR\[[30](https://arxiv.org/html/2605.08747#bib.bib30)\]with ProcTHOR\[[31](https://arxiv.org/html/2605.08747#bib.bib31)\]\-generated houses under a single first\-person, no\-feedback contract; generalization to photorealistic simulators, physical robots, or alternative control interfaces has not been established\. The mandatoryreportprotocol is itself a measurement instrument: it may elicit closure failures that a stop\-only interface would leave latent, so results characterize behavior under this contract rather than model\-optimal performance\.
#### Broader impacts\.
Making premature and missed terminal commitments observable may help improve the reliability of embodied agents in settings where acting without confirming task completion can be costly\. At the same time, progress on semantic closure under a fixed reporting contract should not be mistaken for general embodied competence or calibrated self\-knowledge outside that contract\.
## References
- Zhang et al\. \[2026\]Pingyue Zhang, Zihan Huang, Yue Wang, Jieyu Zhang, Letian Xue, Zihan Wang, Qineng Wang, Keshigeyan Chandrasegaran, Ruohan Zhang, Yejin Choi, et al\.Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?In*ICLR*, 2026\.See also arXiv preprint arXiv:2602\.07055\.
- Gao et al\. \[2026\]Huan\-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, and Mengdi Wang\.CubeBench: Diagnosing Interactive, Long\-Horizon Spatial Reasoning under Partial Observations\.In*ICLR*, 2026\.
- Zitkovich et al\. \[2023\]Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, Quan Vuong, Vincent Vanhoucke, Huong Tran, Radu Soricut, Anikait Singh, Jaspiar Singh, Pierre Sermanet, Pannag R\. Sanketi, Grecia Salazar, Michael S\. Ryoo, et al\.RT\-2: Vision\-Language\-Action Models Transfer Web Knowledge to Robotic Control\.In*CoRL*, 2023\.
- Yang et al\. \[2025a\]Rui Yang, Hanyang Chen, Junyu Zhang, et al\.EmbodiedBench: Comprehensive Benchmarking Multi\-modal Large Language Models for Vision\-Driven Embodied Agents\.In*ICML*, 2025a\.See also arXiv preprint arXiv:2502\.09560\.
- Shridhar et al\. \[2020\]Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox\.ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks\.In*CVPR*, 2020\.
- Choi et al\. \[2024\]Jae\-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang\.LoTa\-Bench: Benchmarking Language\-oriented Task Planners for Embodied Agents\.In*ICLR*, 2024\.
- Khanna et al\. \[2024\]Mukul Khanna, Ram Ramrakhya, Gunjan Chhablani, Sriram Yenamandra, Theophile Gervet, Matthew Chang, Zsolt Kira, Devendra Singh Chaplot, Dhruv Batra, and Roozbeh Mottaghi\.GOAT\-Bench: A Benchmark for Multi\-Modal Lifelong Navigation\.In*CVPR*, 2024\.
- Li et al\. \[2024\]Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Li Erran Li, Ruohan Zhang, et al\.Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making\.In*NeurIPS*, 2024\.
- Cheng et al\. \[2025\]Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, et al\.EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents\.*arXiv preprint arXiv:2501\.11858*, 2025\.
- Peng et al\. \[2026\]Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, and Tong Xu\.How Foundational Skills Influence VLM\-based Embodied Agents: A Native Perspective\.In*AAAI*, 2026\.See also arXiv preprint arXiv:2602\.20687\.
- Kadavath et al\. \[2022\]Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield\-Dodds, Nova DasSarma, Eli Tran\-Vu, et al\.Language Models \(Mostly\) Know What They Know\.*arXiv preprint arXiv:2207\.05221*, 2022\.
- Ren et al\. \[2023\]Allen Z\. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar\.Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners\.In*CoRL*, 2023\.
- Huang et al\. \[2022\]Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter\.Inner Monologue: Embodied Reasoning through Planning with Language Models\.In*CoRL*, 2022\.
- Shridhar et al\. \[2021\]Mohit Shridhar, Xingdi Yuan, Marc\-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht\.ALFWorld: Aligning Text and Embodied Environments for Interactive Learning\.In*ICLR*, 2021\.
- Zheng et al\. \[2022\]Kaizhi Zheng, Xiaotong Chen, Odest Jenkins, and Xin Eric Wang\.VLMbench: A Compositional Benchmark for Vision\-and\-Language Manipulation\.In*NeurIPS*, 2022\.
- Li et al\. \[2023\]Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martin\-Martin, Chen Wang, Gabrael Levine, Michael Lingelbach, Jiankai Sun, et al\.BEHAVIOR\-1K: A Benchmark for Embodied AI with 1,000 Everyday Activities and Realistic Simulation\.In*CoRL*, 2023\.
- Padmakumar et al\. \[2022\]Aishwarya Padmakumar, Jesse Thomason, Ayush Shrivastava, Patrick Lange, Anjali Narayan\-Chen, Spandana Gella, Robinson Piramuthu, Gokhan Tur, and Dilek Hakkani\-Tur\.TEACh: Task\-Driven Embodied Agents That Chat\.In*AAAI*, 2022\.
- Kim et al\. \[2024\]Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, and Jonghyun Choi\.ReALFRED: An Embodied Instruction Following Benchmark in Photo\-Realistic Environments\.In*ECCV*, 2024\.
- Chen et al\. \[2024\]Boyuan Chen, Zhuo Xu, Sean Kirmani, Danny Driess, Pete Florence, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia\.SpatialVLM: Endowing Vision\-Language Models with Spatial Reasoning Capabilities\.In*CVPR*, 2024\.
- Cheng et al\. \[2024\]An\-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu\.SpatialRGPT: Grounded Spatial Reasoning in Vision\-Language Models\.In*NeurIPS*, 2024\.
- Jia et al\. \[2026\]Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi\.OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models\.In*ICLR*, 2026\.See also arXiv preprint arXiv:2506\.03135\.
- Yang et al\. \[2025b\]Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang\.MMSI\-Bench: A Benchmark for Multi\-Image Spatial Intelligence\.In*ICLR*, 2025b\.
- Wang et al\. \[2025a\]Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille\.Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models\.In*CVPR*, 2025a\.
- Yuan et al\. \[2024\]Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox\.RoboPoint: A Vision\-Language Model for Spatial Affordance Prediction for Robotics\.In*CoRL*, 2024\.
- Song et al\. \[2025\]Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield\.RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision\-Language Models for Robotics\.In*CVPR*, 2025\.
- Yang et al\. \[2025c\]Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei\-Fei, and Saining Xie\.Cambrian\-S: Towards Spatial Supersensing in Video\.*arXiv preprint arXiv:2511\.04670*, 2025c\.
- Zhou et al\. \[2025\]Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi\.VLM4D: Towards Spatiotemporal Awareness in Vision Language Models\.In*ICCV*, 2025\.
- Yang et al\. \[2024\]Jihan Yang, Shusheng Yang, Anjali W\. Gupta, Rilyn Han, Li Fei\-Fei, and Saining Xie\.Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces\.In*CVPR*, 2024\.
- Wang et al\. \[2025b\]Qineng Wang, Baiqiao Yin, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, and Li Fei\-Fei\.Spatial Mental Modeling from Limited Views\.In*ICLR*, 2025b\.
- Kolve et al\. \[2017\]Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi\.AI2\-THOR: An Interactive 3D Environment for Visual AI\.*arXiv preprint arXiv:1712\.05474*, 2017\.
- Deitke et al\. \[2022\]Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi\.ProcTHOR: Large\-Scale Embodied AI Using Procedural Generation\.In*NeurIPS*, 2022\.
- Google DeepMind \[2026\]Google DeepMind\.Gemini 3\.1 Pro model card\.[https://deepmind\.google/models/model\-cards/gemini\-3\-1\-pro/](https://deepmind.google/models/model-cards/gemini-3-1-pro/), 2026\.
- Guo et al\. \[2025\]Dong Guo, Faming Wu, Feida Zhu, et al\.Seed1\.5\-VL Technical Report\.*arXiv preprint arXiv:2505\.07062*, 2025\.
- OpenAI \[2026\]OpenAI\.Introducing GPT\-5\.4\.[https://openai\.com/index/introducing\-gpt\-5\-4/](https://openai.com/index/introducing-gpt-5-4/), 2026\.
- Anthropic \[2025\]Anthropic\.Introducing Claude 4\.[https://www\.anthropic\.com/news/claude\-4](https://www.anthropic.com/news/claude-4), May 2025\.
- Qwen Team \[2026a\]Qwen Team\.Qwen3\.6\-27B: Flagship\-level coding in a 27B dense model\.[https://qwen\.ai/blog?id=qwen3\.6\-27b](https://qwen.ai/blog?id=qwen3.6-27b), April 2026a\.
- Qwen Team \[2026b\]Qwen Team\.Qwen3\.5: Towards native multimodal agents\.[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5), February 2026b\.
- Bai et al\. \[2025\]Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al\.Qwen3\-VL Technical Report\.*arXiv preprint arXiv:2511\.21631*, 2025\.
- Wang et al\. \[2025c\]Weiyun Wang, Zhangwei Gao, Lixin Gu, Zhe Chen, et al\.InternVL3\.5: Advancing open\-source multimodal models in versatility, reasoning, and efficiency\.*arXiv preprint arXiv:2508\.18265*, 2025c\.
- Hao et al\. \[2025\]Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, et al\.MiMo\-Embodied: X\-Embodied Foundation Model Technical Report\.*arXiv preprint arXiv:2511\.16518*, 2025\.
- Kimi Team et al\. \[2025\]Kimi Team, Angang Du, et al\.Kimi\-VL Technical Report\.*arXiv preprint arXiv:2504\.07491*, 2025\.
- Ji et al\. \[2025\]Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al\.RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete\.*arXiv preprint arXiv:2502\.21257*, 2025\.
- Dang et al\. \[2026\]Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangping Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, Minghao Zhu, Xiao Lin, Yang Bai, Qian Jiang, Yaxi Zhao, Minghua Zeng, Junlong Gao, Yuming Jiang, Jun Cen, Siteng Huang, Liuyi Wang, Wenqiao Zhang, Chengju Liu, Jianfei Yang, Shijian Lu, and Deli Zhao\.RynnBrain: Open Embodied Foundation Models\.*arXiv preprint arXiv:2602\.14979*, 2026\.
- Yang et al\. \[2025d\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, et al\.Qwen3 Technical Report\.*arXiv preprint arXiv:2505\.09388*, 2025d\.
## Appendix ASystem Prompt Specification
The system prompt is assembled programmatically from four blocks in fixed order\. No privileged simulator state crosses the agent–evaluator boundary\. We reproduce each block verbatim below; each benchmark run stores a SHA\-256 hash of the rendered prompt alongside the profile name and prompt\-policy version, enabling bit\-exact reproducibility audits\.
Block 1: TaskYou are an embodied agent in a simulated 3D environment with an egocentric \(first\-person\) view\. Each step you only see local observations; plan and act under partial observability\. Ground decisions in the task, the current observation, recent observation history when available, and the allowed actions schema\. Actions are embodied attempts in the scene, not symbolic shortcuts\. Use only public evidence from this episode\. Do not assume hidden simulator state or hidden success signals\. For state\-recognition tasks, report the observed state label \(for example: on/off or open/closed\) rather than mapping the state to task success/fail\. Mere visibility, proximity, or an attempted action is not enough; do not report success from visual plausibility alone\. Do not claim completion from visual plausibility alone when the task requires changing world state\. Task:<TASK\_INSTRUCTION\>
Block 2: Action Schema\#\# Allowed actions \- navigate\|look\|interact\_pixel\|report Each skill is serialized from the action registry with its required/optional arguments, type constraints, and usage notes\. The full specification matches Tables[4](https://arxiv.org/html/2605.08747#A2.T4)–[6](https://arxiv.org/html/2605.08747#A2.T6)\(Appendix[B](https://arxiv.org/html/2605.08747#A2)\) verbatim\.
Block 3: Grounding Rules\#\# Grounding Rules \- Use only the frame whose ‘role‘ is ‘current‘ for visibility judgments and interact\_pixel targeting\. \- interact\_pixel acts on the object currently under the specified location in the current RGB frame\. \- Do not assume hidden success state; rely on subsequent public observations and allowed feedback fields from this episode\.
For models using normalized coordinates, an additional rule is prepended:For interact\_pixel, output x and y in normalized\_1000 coordinates: integers in \[0, 1000\], where \(0,0\) is the top\-left and \(1000,1000\) is the bottom\-right corner\.
Block 4: Output Format\#\# Output format Return exactly one JSON object and nothing else\. Required keys: ‘skill\_name‘, ‘arguments‘\. Optional keys: ‘thought‘, ‘cognitive\_state‘\. Use the exact action names from the allowed\-actions schema above; do not invent aliases or synonyms\. Do not output prose, markdown, code fences, or multiple JSON objects\. If you include ‘thought‘, keep it to one short sentence\.
## Appendix BAction Space and Termination Contract
### B\.1Skill Definitions
The action space consists of four skills: two for locomotion \(navigate,look\), one for object interaction \(interact\_pixel\), and one for terminal self\-report \(report\)\. Table[4](https://arxiv.org/html/2605.08747#A2.T4)provides the complete specification\.
Table 4:Complete action\-space specification\.Rdenotes required;Cdenotes conditionally required \(required unless intent isdrop\)\.SkillArgumentTypeDescriptionReq\.navigatemodeenumforward,backward,turn\_left,turn\_rightRmagnitudenumberStep count \(fwd/bwd\) or rotation in degrees \(turns\)Rlookdirectionenumup,downRmagnitudenumberPitch change in degreesRinteract\_pixelintentenumOne of eight canonical intents \(Table[5](https://arxiv.org/html/2605.08747#A2.T5)\)RxintHorizontal pixel coordinate in the current RGB frameCyintVertical pixel coordinate in the current RGB frameCreportstatusenumTerminal status \(Table[6](https://arxiv.org/html/2605.08747#A2.T6)\)RsummarystringBrief justification for the chosen statusR
### B\.2Canonical Interaction Intents
Table[5](https://arxiv.org/html/2605.08747#A2.T5)lists the eight canonical intents accepted byinteract\_pixel\. Common aliases \(e\.g\.,open→\\rightarrowopen\_access,toggle\_on→\\rightarrowactivate\) are normalized at runtime; models may emit either form\.
Table 5:Canonical intentsforinteract\_pixel\.
### B\.3Report Statuses
Table 6:Terminal report statusesaccepted byreport\.In the current frozen pack,unsafeandinvalidare interface\-level fallback statuses, not task\-specific target labels\.
### B\.4Step Budgets and Termination
Each task family enforces a fixed step budget and an invalid\-action limit \(see Table[8](https://arxiv.org/html/2605.08747#A4.T8)in Appendix[D](https://arxiv.org/html/2605.08747#A4)for per\-family values\)\. Exceeding the step budget terminates the episode asfail\_no\_report; exceeding the invalid\-action limit terminates asinvalid\_action\_limit\_exceeded\.
### B\.5Termination Contract
Thereportskill is the*sole*agent\-initiated termination mechanism\. The agent must invoke it with a status and justification to end the episode; the benchmark never issues a report automatically\. Three terminal conditions exist:
1. 1\.Agent report:the agent invokesreport; the status is matched against hidden world state\.
2. 2\.Budget exhaustion:the step budget is reached without a report; scored asfail\_no\_report\.
3. 3\.Invalid\-action limit:cumulative invalid actions \(protocol failures \+ malformed actions\) exceed the family\-specific limit; scored asinvalid\_action\_limit\_exceeded\.
## Appendix CScoring Details
### C\.1Dual\-Metric Evaluation
Each episode is evaluated under two success metrics simultaneously:
- •Semantic\(primary\): tolerant of minor imprecision in object placement or state matching\.
- •Strict: exact simulator predicate check with no tolerance\.
The main paper reports aggregateW,B, andΔ=W−B\\Delta=W\-Bunder the semantic world predicate paired with strict terminalreportmatching \(see §[3\.3](https://arxiv.org/html/2605.08747#S3.SS3)\)\. The strict world metric is retained as a validation variant to expose whether apparent world\-state completion would rest on evaluation leniency\.
### C\.2Reported Closure\-Failure Labels
The main analysis reports three closure\-failure labels, summarized in Table[7](https://arxiv.org/html/2605.08747#A3.T7)\. They are computed post\-hoc from the episode settlement record and are used consistently in Figure[3](https://arxiv.org/html/2605.08747#S4.F3), Table[3](https://arxiv.org/html/2605.08747#S4.T3), and the extended model panels\.
Table 7:Reported closure\-failure labels\.These are the diagnostic labels reported in the main figures and tables\. FR and NR separate incorrect terminal content from missing terminal commitment; IL captures forced termination due to repeated invalid actions\.This decomposition is intended to separate protocol\-compliance failures from task\-state\-judgment failures: IL primarily captures malformed\-action and interface\-alignment breakdowns, while FR/NR characterize closure behavior on trajectories that remain protocol\-valid until settlement\.
### C\.3Deterministic Report\-Matching Rules
The functionmatch\(reportT,sT\)\\texttt\{match\}\(\\texttt\{report\}\_\{T\},s\_\{T\}\)referenced in §[3\.3](https://arxiv.org/html/2605.08747#S3.SS3)operates in two modes, determined by the episode’s success\-condition type\. No family\-specific branching exists in the match logic itself; family\-specific behavior arises only from how the world\-success predicate𝒢sem\\mathcal\{G\}\_\{\\mathrm\{sem\}\}is evaluated\.
#### Goal\-completion mode \(PG, DA, VS, AI, SI, SM, CR\)\.
The episode success condition checks a world\-state predicate \(e\.g\.,agent\_near\_target,object\_state,object\_held\)\. The report status is first normalized to one of eight canonical values \(success,fail,unsafe,invalid,on,off,open,closed\); any other value is treated as invalid\. The match rule is:
match=\(status=success∧W=1\)∨\(status∈\{fail,unsafe,invalid\}∧W=0\)\.\\texttt\{match\}=\\bigl\(\\texttt\{status\}=\\texttt\{success\}\\;\\land\\;W\\\!=\\\!1\\bigr\)\\;\\lor\\;\\bigl\(\\texttt\{status\}\\in\\\{\\texttt\{fail\},\\texttt\{unsafe\},\\texttt\{invalid\}\\\}\\;\\land\\;W\\\!=\\\!0\\bigr\)\.A correctfailreport whenW=0W\\\!=\\\!0therefore counts as a match \(“honest fail”\), and asuccessreport whenW=0W\\\!=\\\!0is a false\-success report\. Under this goal\-completion rule,unsafeandinvalidare handled as negative reports in the same matching set asfail\.
#### State\-verification mode \(SV\)\.
SV episodes use areport\_statussuccess condition with an explicit expected label derived from the hidden simulator state via:
is\_toggled↦on/off,is\_open↦open/closed\.\\texttt\{is\\\_toggled\}\\mapsto\\texttt\{on\}/\\texttt\{off\},\\qquad\\texttt\{is\\\_open\}\\mapsto\\texttt\{open\}/\\texttt\{closed\}\.For benchmark success, the target state must be publicly observable at closure \(W=1W\\\!=\\\!1\) and the report must equal this expected categorical label\. If the state is not publicly observable at closure, the episode is not a benchmark success regardless of report content\. The match rule is exact string equality between the normalized report status and the expected label:
match=\(status=expected\_label\)\.\\texttt\{match\}=\\bigl\(\\texttt\{status\}=\\texttt\{expected\\\_label\}\\bigr\)\.A stop\-only protocol cannot represent this judgment because the correct terminal output is categorical, not binary\.
## Appendix DTask Family Specifications
The eight task families are organized into adiagnostic tier\(PG, DA, VS, SV\) that isolates atomic competencies and acompositional tier\(AI, SI, SM, CR\) that chains them under increasing partial\-observability demand\. Table[8](https://arxiv.org/html/2605.08747#A4.T8)summarizes the key parameters; detailed descriptions follow\.
Table 8:Task family specification summary\.“Vis\.” indicates whether the target is visible at episode start\. “Steps” and “Inv\.” are the episode step budget and invalid\-action limit\. “Skills” lists the expected skill repertoire; exact initial poses and distances vary by authored episode and are simulator\-validated during pack construction\.#### PG: Pixel Grounding\.
The target object is visible and reachable at episode start; the agent must click \(ground\) it viainteract\_pixel\(ground\), then submit areport\. This isolates visual grounding—the ability to map a natural\-language reference to a pixel coordinate in the egocentric frame\. Success criterion:target\_grounded\(the grounding click falls on the correct object instance\)\. Difficulty varies by target object size: large \(e\.g\., fridge, television\), medium \(e\.g\., microwave, cabinet\), or small \(e\.g\., egg, pen, watch\)\.
#### DA: Distance Approach\.
The target is visible but initially outside the success threshold; the agent must navigate close enough, then report\. This isolates spatial navigation under egocentric observation—coarse depth estimation and obstacle avoidance without any interaction\. Success criterion:agent\_near\_target\(Euclidean distance<1\.5<1\.5m\)\.
#### VS: View Search\.
The target is*not visible*at episode start\. The agent must change viewpoint throughnavigateand/orlookactions until the target becomes visible, then report\. This isolates active search under partial observability—the ability to systematically explore the environment\. Success criterion:object\_state\(visible=True\), latched once achieved\.
#### SV: State Verification\.
The agent starts with a visible object whose state \(on/off or open/closed\) must be identified and reported\. This isolates perceptual state recognition—the ability to observe and classify the current state of an object\. Success criterion:report\_statusmatching the oracle state label\.
#### AI: Approach and Interact\.
The target is visible at episode start\. The agent must approach as needed, perform a single pixel\-grounded interaction \(activate, deactivate, open, close, or pick\), then report\. This composes DA \(approach\) and PG\-level grounding with a state\-changing interaction\. Success criterion:object\_stateorobject\_helddepending on intent\.
#### SI: Search and Interact\.
The target is not visible at episode start\. The agent must find the target, approach as needed, perform the required pixel\-grounded interaction, and report\. This is the full perception–navigation–interaction pipeline under maximal partial observability\. Success criterion: same as AI, conditioned on prior search\.
#### SM: Sequential Manipulation\.
The instruction requires an ordered manipulation chain, such as opening a container, picking an object, and placing it at a destination\. Four functional templates exist:reveal\_pick\(open container, pick hidden object\),put\_into\(place object in receptacle\),rearrange\(move object between receptacles\), andopen\_pick\_place\(open, pick, place elsewhere\)\. Success criterion:object\_heldorobject\_at\_receptacle\.
#### CR: Constraint Resolving\.
The target is visible at episode start, but a physical constraint blocks the direct completion path\. The agent must resolve the constraint \(e\.g\., move or interact with an obstacle\), navigate as needed, perform the required interaction, and report\. This tests planning under physical constraints—the agent must reason about path feasibility before completing the interaction\. Success criterion: same as AI, with the additional requirement that the constraint is resolved\.
## Appendix EBenchmark Episode Details
### E\.1Episode Generation Pipeline
Episodes are constructed through a three\-stage pipeline:
1. 1\.LLM proposal: a language model drafts candidate tasks conditioned on scene inventories and family\-specific constraints \(target visibility, start\-pose requirements, available intents, object categories\)\.
2. 2\.Simulator validation: each proposal is instantiated in AI2\-THOR and validated for object existence, state accessibility, agent reachability, and success\-condition solvability\. The validation engine checks episode\-contract integrity including agent initialization, scene setup consistency, success\-spec type validity, and family\-specific rules \(e\.g\., SM requires pre\-conditions and multi\-object bindings; CR requires blocked\-path preconditions\)\.
3. 3\.Human review: a human auditor reviews a stratified sample for ambiguity, instruction quality, and difficulty calibration\. Manually approved episodes receive priority in pack assembly\.
### E\.2Pack Composition
The evaluation uses packmixed\_mainline\_manual\_balanced\_1000, containing 1,000 episodes with exactly 125 per task family\. These episodes are selected from a larger authored pool spanning diverse AI2\-THOR environments, including ProcTHOR\-generated scenes\.
Table 9:Episode distributionin the 1,000\-episode evaluation pack\.FamilyEpisodesMax StepsPG Pixel Grounding1255DA Distance Approach12512VS View Search12520SV State Verification1255AI Approach and Interact12525SI Search and Interact12535SM Sequential Manipulation12530CR Constraint Resolving12540Total1,000
### E\.3Simulator Configuration
All episodes run in AI2\-THOR \(ProcTHOR scenes\) with fixed rendering parameters: resolution640×480640\\times 480, field of view90∘90^\{\\circ\}, visibility distance6\.06\.0m, interaction distance limit1\.51\.5m, and instance segmentation enabled for grounding evaluation\. Segmentation is used only by the evaluator for offline grounding and settlement; it is never exposed to the agent, which receives only RGB frames under the default contract\. The agent’s initial position, rotation, and camera horizon are specified per episode and validated against the scene setup\.
## Appendix FPrimary 20\-Model Comparison Panel
Table[10](https://arxiv.org/html/2605.08747#A6.T10)reports per\-familyW/Bfor the primary 20\-model panel used for cross\-model comparison\. The panel includes models that reliably operate the native\-control interface and are not dominated by invalid\-action\-limit failures or zero benchmark success; additional low\-compliance, zero\-B, and supplementary size/serving\-mode variants are reported separately in Table[11](https://arxiv.org/html/2605.08747#A7.T11)\. The main text \(Table[2](https://arxiv.org/html/2605.08747#S4.T2)\) uses 10 anchor models; this table adds 10 additional open\-weight models spanning MoE variants, smaller checkpoints, and embodied\-tuned systems, including Kimi\-VL\[[41](https://arxiv.org/html/2605.08747#bib.bib41)\], RoboBrain2\.5\-4B\[[42](https://arxiv.org/html/2605.08747#bib.bib42)\], MiMo\-Embodied\-7B\[[40](https://arxiv.org/html/2605.08747#bib.bib40)\], rynnbrain\-8B\[[43](https://arxiv.org/html/2605.08747#bib.bib43)\], the Qwen3\-VL and Qwen3\.x families\[[38](https://arxiv.org/html/2605.08747#bib.bib38),[44](https://arxiv.org/html/2605.08747#bib.bib44),[37](https://arxiv.org/html/2605.08747#bib.bib37),[36](https://arxiv.org/html/2605.08747#bib.bib36)\], InternVL3\.5\[[39](https://arxiv.org/html/2605.08747#bib.bib39)\], and closed\-source APIs Gemini 3\.1 Pro\[[32](https://arxiv.org/html/2605.08747#bib.bib32)\], Doubao\-Seed\-1\.8\[[33](https://arxiv.org/html/2605.08747#bib.bib33)\], GPT\-5\.4\[[34](https://arxiv.org/html/2605.08747#bib.bib34)\], and Claude\-Sonnet\-4\[[35](https://arxiv.org/html/2605.08747#bib.bib35)\]\. In that 10\-anchor set, Qwen3\-VL\-32B Instruct and Qwen3\-VL\-32B\(T\)Thinking are counted as two distinct anchor models \(\(T\)= public Thinking serving mode\); Qwen3\.6\-27B and Qwen3\.5\-27B are served with thinking enabled \(one anchor row each\); every other checkpoint appears once\. Suffixes such asA3B/A17B/A22Bdenote*activated*expert parameters in MoE\-style checkpoints\.
Table 10:Primary 20\-model comparison panel under native control\.Each cell reportsW/B\(%\)\. Diagnostic: PG = pixel grounding, DA = distance approach, VS = view search, SV = state verification\. Compositional: AI = approach\-and\-interact, SI = search\-and\-interact, SM = sequential manipulation, CR = constraint resolving\. Models are grouped by category and sorted by aggregateBwithin each group\.†Closed\-source API models\.
One notable anomaly: RoboBrain2\.5\-4B\[[42](https://arxiv.org/html/2605.08747#bib.bib42)\]achieves W = 94\.4% on SV \(state verification\) yet B = 0\.0%, indicating that it never uses the categorical report labels \(on/off,open/closed\) required for SV episodes and instead reports onlysuccess/fail, which the deterministic match rule cannot accept for state\-verification tasks\.
#### Thinking vs\. Instruct serving modes\.
Several checkpoints offer both Instruct and public “Thinking” serving modes\. On the frozen evaluation pack, Qwen3\-VL\-32B\(T\)\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]achieves higher aggregateB\(26\.4 vs\. 18\.9\) with a substantially smallerΔ\\Delta\(4\.3 vs\. 20\.3 pp\) compared to its Instruct counterpart, shifting the profile from NR\-heavy toward a more balanced mixture\. We treat this as a behavioral observation rather than a mechanistic claim: serving mode changes the execution/reporting tradeoff under a fixed contract, but the current evidence does not identify which internal reasoning differences cause the shift\. For the remaining anchor checkpoints, we do not add further Instruct/Thinking pairing ablations beyond the Qwen3\-VL\-32B\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]split above—each appears under one fixed deployed configuration\. Additional smaller Thinking variants appear in Table[11](https://arxiv.org/html/2605.08747#A7.T11)\.
## Appendix GExtended Model Panel
Table[10](https://arxiv.org/html/2605.08747#A6.T10)reports the primary 20\-model comparison panel\. Table[11](https://arxiv.org/html/2605.08747#A7.T11)lists eight additional models evaluated on the same frozen 1,000\-episode pack: three low\-compliance or zero\-Bmodels moved out of the main analysis because invalid\-action\-limit \(IL\) or no\-report terminations dominate their outcomes, plus five supplementary size and serving\-mode variants\.
Table 11:Extended model panel — models not in main tables\.Same evaluation contract and frozen episode pack as the main analysis\.W= world\-state completion \(%\),B= benchmark success \(%\),Δ\\Delta=W−BW\-Bgap \(pp\),FR= false\-report rate \(%\),NR= no\-report rate \(%\),IL= invalid\-action\-limit rate \(%\)\. Models above the rule are moved out of the main tables because low compliance, no\-report behavior, or zero benchmark success makes closure\-regime comparison less informative; models below the rule are supplementary variants evaluated under the same contract\. FR/NR/IL are diagnostic rates and are not mutually exclusive\.ModelBW𝚫\\boldsymbol\{\\Delta\}FRNRILStepsExcluded from main tables \(low\-compliance or zero B\)Qwen3\-VL\-2B0\.012\.712\.71\.198\.998\.87\.9Qwen3\.5\-2B6\.617\.010\.42\.390\.858\.113\.9Kimi\-VL\-A3B\-Instruct0\.018\.318\.31\.198\.919\.920\.6Supplementary variantsQwen3\.5\-9B23\.433\.19\.715\.860\.228\.114\.6Qwen3\-VL\-4B\(T\)17\.924\.66\.759\.121\.512\.68\.6Qwen3\-VL\-8B\(T\)17\.721\.03\.353\.128\.818\.09\.5Qwen3\-VL\-2B\(T\)3\.615\.211\.625\.471\.069\.38\.5InternVL3\.5\-2B0\.013\.513\.53\.496\.667\.214\.3#### Observations\.
The three low\-compliance or zero\-Bmodels \(Qwen3\-VL\-2B, Qwen3\.5\-2B, Kimi\-VL\-A3B\-Instruct\)\[[38](https://arxiv.org/html/2605.08747#bib.bib38),[37](https://arxiv.org/html/2605.08747#bib.bib37),[41](https://arxiv.org/html/2605.08747#bib.bib41)\]show that some low\-capacity or format\-fragile systems fail before the terminal\-judgment regime becomes informative: theirWW–BBgaps are driven mainly by invalid\-action\-limit or no\-report terminations, not by mismatchedreportdecisions\. Among the supplementary variants, the smaller Thinking\-mode models \(4B\-T, 8B\-T\) show qualitatively similar failure profiles to their Instruct counterparts—high FR with moderate IL—under this fixed prompt and evaluation contract\. Qwen3\.5\-9B\[[37](https://arxiv.org/html/2605.08747#bib.bib37)\]is the strongest supplementary model \(B = 23\.4%, W = 33\.1%\) with a relatively high IL of 28\.1%, placing it between the main\-panel and edge\-failure regimes\.
## Appendix HReport\-Policy Baseline Details
Table[12](https://arxiv.org/html/2605.08747#A8.T12)reports counterfactual benchmark success when only the*content*of the terminal report is replaced on fixed trajectories, leaving execution, stop timing, and the final world state unchanged\. This baseline is not a model\-visible oracle experiment: hidden state is used only offline to rescore what would have happened under a different terminal report\.
Table 12:Counterfactual report policies \(10 anchor models\)\.B= actual benchmark success \(%\)\.Always=Bwhen terminal report is replaced with always\-success\.Rand\.=Bunder uniform\-random report\. Bold Always entries: Always\>\>B \(NR\-heavy models where forced closure recovers missed reports\)\.#### Policies\.
Actualis the observed benchmark success rate\.Always\-successappends or substitutesreport\(status=success\)at the final state and recomputes whether that report matches the expected terminal label\. For trajectories that originally ended by no\-report exhaustion, this should be read as a forced final report at the exhausted state, not as evidence that the model would have chosen the correct stopping time\. This baseline can succeed on goal\-completion tasks only when the final world predicate is satisfied, but it fails state\-verification episodes whose correct terminal labels areopen/closedoron/off\.Random\-reportreports the chance\-level expectation under a uniform draw over the two admissible labels for each episode: \{success,fail\} for goal\-completion tasks, and \{open,closed\} or \{on,off\} for state\-verification tasks\. Equivalently, it yields0\.5W0\.5Win expectation on each fixed final state\.Oracle\-reportuses the evaluator’s correct terminal label at the same final state; therefore itsBequalsW\. It is a fixed\-trajectory upper\-bound sanity check rather than a model\-side report\-format baseline\.
The always\-success policy underperforms actual reports for every anchor except two NR\-heavy models \(bolded in Table[12](https://arxiv.org/html/2605.08747#A8.T12)\): Qwen3\-VL\-32B\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]\(\+10\.0 pp\) and InternVL3\.5\-38B\[[39](https://arxiv.org/html/2605.08747#bib.bib39)\]\(\+11\.7 pp\), where forcing a terminal success report recovers otherwise missed closures\. Conversely, FR\-heavy models such as Claude\-Sonnet\-4\[[35](https://arxiv.org/html/2605.08747#bib.bib35)\]and GPT\-5\.4\[[34](https://arxiv.org/html/2605.08747#bib.bib34)\]already report aggressively; replacing their reports with a fixed success policy further depressesB, confirming a report\-mismatch problem rather than a missing\-report problem\. The random policy provides a lower bound: uniform random reports yield roughly half ofW, as expected when report status is uncorrelated with task state\.
#### Implementation\.
The counterfactual analysis uses the same saved 1,000\-episode native\-control traces as the corresponding rows in the main model panel\.
## Appendix IBelief Lag and Premature Commitment
Table[13](https://arxiv.org/html/2605.08747#A9.T13)provides step\-level detail on the 10 anchor models’ terminal\-commitment timing\. The main\-text analysis \(§[4\.2](https://arxiv.org/html/2605.08747#S4.SS2)\) draws on two patterns from this table: \(i\) correct reports arrive within 0\.9–1\.9 steps of first goal satisfaction, and \(ii\) 65–88% of false\-success reports are issued at zero task progress\. Task progress is a continuous\[0,1\]\[0,1\]scalar computed by the per\-step evaluator from the episode’s success specification: it aggregates normalized sub\-condition scores \(e\.g\., distance to target, object\-state satisfaction, grounding correctness\) at the moment of the report\. Zero progress means no sub\-condition has advanced beyond its initial value—the agent has not navigated closer, changed any task\-relevant object state, or achieved any partial goal\.
Table 13:Closure lag and premature commitment \(10 anchor models\)\.W\+= episodes with world\-state completion \(count out of 1,000\)\.Lag= mean steps from first goal satisfaction to correct terminal report \(for episodes withW=1W\\\!=\\\!1and correct report\), measuring observable completion\-to\-report delay at closure\.NR=W=1W\\\!=\\\!1episodes with no report \(count\)\. For false\-success reports: count \(FS\), % issued at zero task progress \(@0\)\.### I\.1Conditional Report Rates
Table[14](https://arxiv.org/html/2605.08747#A9.T14)reports conditional stop and report rates for the 10 anchor models\.P\(rep\|\|W=0\)is the probability that the model issues a report when the world condition is*not*satisfied; high values indicate a stronger tendency to terminate under world failure, which reflects indiscriminate reporting when paired with high FR\.P\(¬\\negrep\|\|W=1\)is the probability that the model fails to report when the world condition*is*satisfied; high values indicate missed closure\. Note that these conditional*rates*are recoverable from stop timing alone—they measure whether the agent terminated, not what it said\. Report*content*becomes essential for a different reason: determining whether the agent’s stated judgment matches the hidden world state \(e\.g\.,openvs\.closedfor state\-verification tasks\), which is the basis for the FR/NR decomposition \(Figure[3](https://arxiv.org/html/2605.08747#S4.F3)\) and the counterfactual analysis in Table[12](https://arxiv.org/html/2605.08747#A8.T12)\.
FR\-heavy models \(Claude\-Sonnet\-4\[[35](https://arxiv.org/html/2605.08747#bib.bib35)\], GPT\-5\.4\[[34](https://arxiv.org/html/2605.08747#bib.bib34)\]\) report in\>\>85% of W=0 episodes, whereas NR\-heavy models \(InternVL3\.5\-38B\[[39](https://arxiv.org/html/2605.08747#bib.bib39)\], Qwen3\-VL\-32B\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]\) miss closure in\>\>40% of W=1 episodes\. These opposite conditional profiles overlap in aggregate scalar success \(B≈\\approx17–25%\)\.
Table 14:Conditional report rates \(10 anchor models\)\.Stop%= fraction of episodes where areportis issued\.P\(rep\|\|W=0\)= report rate conditioned on world failure\.P\(¬\\negrep\|\|W=1\)= missed\-closure rate conditioned on world success\.W=0andW=1columns show the conditioning set sizes\.
## Appendix JAction\-Feedback Intervention Details
The minimal action\-feedback intervention \(§[4\.3](https://arxiv.org/html/2605.08747#S4.SS3), Table[3](https://arxiv.org/html/2605.08747#S4.T3)\) adds two boolean signals—too\_far\(interaction attempted beyond the 1\.5 m proximity threshold\) andpath\_blocked\(navigation blocked by an obstacle\)—after each action, alongside the standard RGB frame and dialogue history\. These model the proprioceptive and low\-level controller feedback available to physical robots\. Theretrycolumns in Table[3](https://arxiv.org/html/2605.08747#S4.T3)report mean per\-episode counts of redundantinteract\_pixelsteps that repeat an identical invocation \(every repeat after the first occurrence of each invocation signature increments the episode total\)\. This is outside the default benchmark contract; its purpose is to test whether exposing execution\-level action\-outcome signals changes reporting behavior in paired runs, not to define a second benchmark setting\.
## Appendix KRobustness and Reproducibility Checks
This section collects checks that support the stability of the reported closure profiles without expanding the primary comparison panel: a neutral\-report prompt variant and paired reruns over the same 20 open\-weight configurations for which paired robustness runs were available\.
### K\.1Run Metadata and Determinism
#### Prompt policy\.
All runs use prompt policynative\_embodied\_public\_evidence\(version strings identify the template family and allowedreportlabel vocabulary\)\. Each benchmark run records the prompt\-policy SHA\-256 hash, profile name, rendered\-prompt hash, pack hash, and repository commit, enabling bit\-exact audit of the evaluation contract\.
#### Frozen pack and run randomness\.
All models are evaluated on the same frozen episode packmixed\_mainline\_manual\_first\_balanced\_1000\_v1; episode IDs and success specifications are fixed\. Any “seed” in run logs affects only job scheduling or tie\-breaking in the inference stack, not which episodes enter the aggregate\.
#### Evaluation profile\.
The default profile ispure\_rgb\_dialogue\_history\_baseline: single\-frame egocentric RGB, 20\-turn text\-only dialogue history, no depth, no pose, no visual history frames, and no agent\-side memory\.
#### Infrastructure\.
Models are served via vLLM with default parameters; no model\-specific hyperparameter tuning is applied\. Episode order within each pack run is deterministic \(sorted by episode ID\)\. Specific closed\-source model identifiers are listed in Table[2](https://arxiv.org/html/2605.08747#S4.T2)footnotes\.
### K\.2Prompt Sensitivity Analysis
The default system prompt includes two anti\-hallucination instructions in the task block \(Appendix[A](https://arxiv.org/html/2605.08747#A1)\):*“Mere visibility, proximity, or an attempted action is not enough; do not report success from visual plausibility alone”*and*“Do not claim completion from visual plausibility alone when the task requires changing world state\.”*To test whether closure\-failure profiles are artifacts of these instructions, we run aneutral\-reportprompt variant that removes both lines while keeping the rest of the evaluation contract identical: same frozen episode pack, same profile \(pure\_rgb\_dialogue\_history\_baseline\), same scoring rules\. Table[15](https://arxiv.org/html/2605.08747#A11.T15)reports per\-familyW/Bfor all 20 open\-weight configurations for which paired neutral\-prompt runs were available\. Closed\-source models are omitted because paired neutral\-prompt runs were not available\.
Table 15:Prompt sensitivity: neutral\-report variant \(20 open\-weight configurations\)\.Same evaluation contract as the main analysis except that both anti\-hallucination instructions are removed from the system prompt\. Each cell reportsW/B\(%\)\. Diagnostic: PG = pixel grounding, DA = distance approach, VS = view search, SV = state verification\. Compositional: AI = approach\-and\-interact, SI = search\-and\-interact, SM = sequential manipulation, CR = constraint resolving\. Configurations sorted by aggregateB\.Across all 20 configurations, aggregateWshifts by at most 3\.5 pp and aggregateBby at most 4\.8 pp relative to the corresponding default\-prompt runs\. NR\-heavy models remain NR\-heavy \(InternVL3\.5\-38B\[[39](https://arxiv.org/html/2605.08747#bib.bib39)\]: 68\.2% NR vs\. 66\.1% under the default prompt\) and low\-Δ\\Deltamodels remain low\-Δ\\Delta\(Qwen3\-VL\-8B\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]:Δ=1\.3\\Delta\\\!=\\\!1\.3vs\. 1\.5 pp\)\. For the paired open\-weight configurations, the main closure\-failure profiles are therefore unlikely to be artifacts of the specific anti\-hallucination instructions in the default prompt\.
### K\.3Repeated\-Run Robustness for Open\-Weight Models
To check whether the open\-weight conclusions depend on a single serving run, we repeat the default no\-feedback evaluation for the same 20 open\-weight configurations used in the prompt\-sensitivity check on the same frozen 1,000\-episode pack and the samepure\_rgb\_dialogue\_history\_baselineprofile\. Table[16](https://arxiv.org/html/2605.08747#A11.T16)reports the mean and half\-range across the two runs\. This robustness panel is intentionally broader than the primary 20\-model comparison panel in some directions and narrower in others: it includes low\-compliance and supplementary size/serving\-mode variants, while closed\-source API models and the largest open\-weight runs are kept as single\-run evaluations because paired reruns were not available under the same resource budget\.
Table 16:Repeated\-run robustness for 20 open\-weight configurations\.Each cell reports mean±\\pmhalf\-range over two identical\-contract runs on the same frozen episode pack\.W= world\-state completion,B= benchmark success,Δ=W−B\\Delta=W\-B,FR= false\-report rate, andNR= no\-report rate; all values are percentages\.The paired runs preserve the qualitative profile assignments used in the main analysis\. AggregateBis highly stable for most configurations: 19 of 20 have a half\-range below 1 pp, and the largest half\-range is Qwen3\-VL\-32B\(T\)\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]at 2\.7 pp\. The most important closure regimes are also stable: InternVL3\.5\-38B\[[39](https://arxiv.org/html/2605.08747#bib.bib39)\]remains strongly NR\-heavy \(NR = 66\.2±\\pm0\.0\), Qwen3\-VL\-32B\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]remains high\-Δ\\Deltaand NR\-heavy \(Δ=19\.6±0\.7\\Delta=19\.6\\pm 0\.7, NR = 45\.0±\\pm0\.4\), and low\-Δ\\Deltamodels such as Qwen3\-VL\-8B\[[38](https://arxiv.org/html/2605.08747#bib.bib38)\]remain low\-gap across both runs\. The repeated\-run panel should therefore be read as a robustness check over open\-weight configurations, not as a replacement for the primary 20\-model comparison panel\.
## Appendix LQualitative Trajectory Examples
To complement the aggregate results, we visualize two representative trajectories under the same native\-control contract used in the main evaluation\. The examples highlight the two most interpretable failure surfaces: a single\-frame state\-verification judgment and a short approach–interaction episode\.
### L\.1State Verification: Same Frame, Opposite Reports
The SV episodeprocthor\_microwave\_seed298asks the agent to observe a microwave and report whether it is*open*or*closed*\. The microwave is in fact closed; the ground\-truth expected report isclosed\. Because the target is already in view and no physical action is required, the episode reduces to a pure perceptual judgment followed by a terminal report\.
Figure 5:State\-verification trajectory example\.The same initial observation can lead to correct closure \(Gemini\-3\.1\-Pro\[[32](https://arxiv.org/html/2605.08747#bib.bib32)\]and GPT\-5\.4\[[34](https://arxiv.org/html/2605.08747#bib.bib34)\]\) or a false report \(Doubao\-Seed\-1\.8\[[33](https://arxiv.org/html/2605.08747#bib.bib33)\]\)\. Claude\-Sonnet\-4\[[35](https://arxiv.org/html/2605.08747#bib.bib35)\]moves away from the initially visible microwave before reporting, illustrating that even atomic verification probes can fail through unnecessary action followed by an incorrect terminal judgment\.
### L\.2Grounded Interaction: Hallucinated Success
The AI episodeprocthor\_fridge\_seed314requires the agent to approach and open a refrigerator\. The target is visible from the start, but success requires approaching within interaction range, executing the correct pixel\-grounded interaction, and reporting only after the world state changes\.
Figure 6:Approach\-and\-interact trajectory example\.Gemini\-3\.1\-Pro\[[32](https://arxiv.org/html/2605.08747#bib.bib32)\]opens the refrigerator and reports success after the state change\. The other models submit terminal reports inconsistent with the final world state, either after failed interaction attempts or after claiming the wrong state\.Similar Articles
What mechanisms are you using to distinguish "agent busy" from "task completed"?
This article discusses an anti-pattern in AI agent systems where agents appear busy but fail to complete tasks. The author suggests separating responsibilities and requiring proof of completion as a solution.
Escaping the Self-Confirmation Trap: An Execute-Distill-Verify Paradigm for Agentic Experience Learning
This paper proposes the EDV framework, which uses multiple heterogeneous agents in execute-distill-verify stages to build reliable experiences for LLM agents, preventing self-confirmatory errors and improving performance on long-horizon benchmarks.
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
Proposes VeGAS, a test-time framework for MLLM-based embodied agents that samples multiple candidate actions and uses a generative verifier to select the most reliable, achieving up to 36% relative improvement over CoT baselines on challenging tasks.
Can an AI agent complete a task and still fail?
This paper introduces the concept of 'Verifier Tax' to categorize AI agent outcomes as safe success, unsafe success, or failure, and proposes a two-tier verification architecture for tool-using LLM agents.
The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]
This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.