Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents

arXiv cs.AI Papers

Summary

Proposes VeGAS, a test-time framework for MLLM-based embodied agents that samples multiple candidate actions and uses a generative verifier to select the most reliable, achieving up to 36% relative improvement over CoT baselines on challenging tasks.

arXiv:2605.12620v1 Announce Type: new Abstract: Building generalist embodied agents capable of solving complex real-world tasks remains a fundamental challenge in AI. Multimodal Large Language Models (MLLMs) have significantly advanced the reasoning capabilities of such agents through strong vision-language knowledge and chain-of-thought (CoT) reasoning, yet remain brittle when faced with challenging out-of-distribution scenarios. To address this, we propose Verifier-Guided Action Selection (VegAS), a test-time framework designed to improve the robustness of MLLM-based embodied agents through an explicit verification step. At inference time, rather than committing to a single decoded action, VeGAS samples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy. Crucially, we find that using an MLLM off-the-shelf as a verifier yields no improvement, motivating our LLM-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments, VeGAS consistently improves generalization, achieving up to a 36% relative performance gain over strong CoT baselines on the most challenging multi-object, long-horizon tasks.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 06:13 AM

# Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
Source: [https://arxiv.org/html/2605.12620](https://arxiv.org/html/2605.12620)
Nishad Singhi Christian Bialas Snehal Jauhri Vignesh Prasad Georgia Chalvatzaki Marcus Rohrbach Anna Rohrbach Technical University of Darmstadt & hessian\.AI nishad\.singhi@tu\-darmstadt\.de [Webpage](https://nishadsinghi.github.io/vegas)[Code](https://github.com/nishadsinghi/vegas)

###### Abstract

Building generalist embodied agents capable of solving complex real\-world tasks remains a fundamental challenge in AI\. Multimodal Large Language Models \(MLLMs\) have significantly advanced the reasoning capabilities of such agents through strong vision\-language knowledge and chain\-of\-thought \(CoT\) reasoning, yet remain brittle when faced with challenging out\-of\-distribution scenarios\. To address this, we proposeVerifier\-Guided Action Selection \(VeGAS\), a test\-time framework designed to improve the robustness of MLLM\-based embodied agents through an explicit verification step\. At inference time, rather than committing to a single decoded action,VeGASsamples an ensemble of candidate actions and uses a generative verifier to identify the most reliable choice, without modifying the underlying policy\. Crucially, we find that using an MLLM off\-the\-shelf as a verifier yields no improvement, motivating our LLM\-driven data synthesis strategy, which automatically constructs a diverse curriculum of failure cases to expose the verifier to a rich distribution of potential errors at training time\. Across embodied reasoning benchmarks spanning the Habitat and ALFRED environments,VeGASconsistently improves generalization, achieving up to a36% relative performance gainover strong CoT baselines on the most challenging multi\-object, long\-horizon tasks\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.12620v1/x1.png)Figure 1:Overview of Verifier\-Guided Action Selection \(VeGAS\)\.Given a task instruction \(e\.g\., “Find a sports object and place it on the counter”\), standard policies decode a single action that may be incorrect under distribution shifts \(right\)\.VeGAS, instead, samples multiple candidate actions with reasoning traces, evaluates them using a generative verifier, and executes only the highest\-scored action \(bottom\)\. This test\-time verification strategy substantially improves robustness in challenging out\-of\-distribution scenarios involving long\-horizon tasks\.A longstanding goal in AI is to create embodied agents that operate autonomously in physical environments to accomplish complex tasks specified through natural language\[[8](https://arxiv.org/html/2605.12620#bib.bib39),[22](https://arxiv.org/html/2605.12620#bib.bib38)\], such as navigating to target locations\[[31](https://arxiv.org/html/2605.12620#bib.bib11)\]and manipulating everyday objects\[[39](https://arxiv.org/html/2605.12620#bib.bib13),[36](https://arxiv.org/html/2605.12620#bib.bib18)\]\. Recently, Multimodal Large Language Models \(MLLMs\), pretrained on Internet\-scale vision\-language data, have emerged as a promising foundation for building such agents, owing to their strong perceptual and linguistic generalization\[[50](https://arxiv.org/html/2605.12620#bib.bib10),[15](https://arxiv.org/html/2605.12620#bib.bib40),[45](https://arxiv.org/html/2605.12620#bib.bib2)\]\. While early efforts relied on the zero\-shot capabilities of MLLMs\[[15](https://arxiv.org/html/2605.12620#bib.bib40),[32](https://arxiv.org/html/2605.12620#bib.bib3),[1](https://arxiv.org/html/2605.12620#bib.bib41)\], finetuning on embodied data—either via supervised learning\[[51](https://arxiv.org/html/2605.12620#bib.bib42)\]or reinforcement learning\[[45](https://arxiv.org/html/2605.12620#bib.bib2),[44](https://arxiv.org/html/2605.12620#bib.bib34),[54](https://arxiv.org/html/2605.12620#bib.bib43)\]—yields significant improvements\. More recently, incorporating Chain\-of\-Thought \(CoT\) reasoning has further enhanced decision\-making by enabling agents to reason step\-by\-step before acting\[[48](https://arxiv.org/html/2605.12620#bib.bib26),[28](https://arxiv.org/html/2605.12620#bib.bib14),[53](https://arxiv.org/html/2605.12620#bib.bib25),[54](https://arxiv.org/html/2605.12620#bib.bib43)\]\. Despite this progress, MLLM\-based embodied agents remain brittle in out\-of\-distribution scenarios and long\-horizon settings\[[50](https://arxiv.org/html/2605.12620#bib.bib10)\]\. For instance, an agent might reliably execute “bring me a banana” but fail when the same goal is phrased as “bring me a yellow curved fruit\.” Similarly, an agent trained on single\-object pick\-and\-place may fail on a multi\-step task such as cleaning an apple and placing it in a cabinet\.

We observe that a key factor underlying these failures is that agents cannot recognize mistakes in their reasoning process and correct them at test time\. In particular, they commit to a single greedily decoded action at each step with no opportunity for self\-correction\. In contrast, humans routinely consider multiple candidate actions, mentally evaluate their likely outcomes, and commit only to the most promising one, effectively performing verification before acting\. This idea has a direct computational analogue: recent work on scaling test\-time compute shows that sampling multiple candidate solutions and selecting the best one via a learned verifier substantially improves LLM performance in domains such as coding and mathematics\[[5](https://arxiv.org/html/2605.12620#bib.bib28),[4](https://arxiv.org/html/2605.12620#bib.bib53),[38](https://arxiv.org/html/2605.12620#bib.bib54)\]\. However, extending verification to high\-level embodied reasoning poses distinct challenges: unlike in mathematics or code, embodied agents operate under partial observability and must reason about semantic task progression from egocentric observations alone, with compounding errors in long\-horizon plans\. Yet verification for high\-level embodied reasoning remains largely unexplored\.

To bridge this gap, we introduceVerifier\-Guided Action Selection \(VeGAS\), a framework that improves the robustness of MLLM\-based embodied agents by incorporating an explicit verification step at test time\. Concretely, at each timestepVeGASsamples multiple candidate actions from the policy, each accompanied by a Chain\-of\-Thought rationale\. A learned generative verifier\[[55](https://arxiv.org/html/2605.12620#bib.bib27),[2](https://arxiv.org/html/2605.12620#bib.bib30)\]then evaluates each candidate by producing an explicit reasoning trace followed by a correctness judgement, and the agent executes only the highest\-scoring action \(see Figure[1](https://arxiv.org/html/2605.12620#S1.F1)for an overview\)\. A critical finding is that using an MLLM as a verifier off\-the\-shelf yields no improvement over the base policy — general\-purpose language understanding alone is insufficient for embodied verification\. This motivates specialized verifier training; however, standard embodied datasets contain only successful demonstrations, providing no signal for what constitutes an incorrect action\. To address this, we introduce an LLM\-driven pipeline that automatically synthesizes diverse, realistic failure trajectories paired with verification annotations, constructing a rich curriculum of both correct and incorrect examples without additional human data collection\.

VeGASyields consistent improvements across embodied reasoning benchmarks in Habitat 2\.0\[[43](https://arxiv.org/html/2605.12620#bib.bib45),[45](https://arxiv.org/html/2605.12620#bib.bib2)\]and AI2\-THOR\[[34](https://arxiv.org/html/2605.12620#bib.bib35)\], raising average success rates from 65% to 71% on LangR and from 44% to 49% on EB\-ALFRED over strong CoT baselines, while also improving significantly larger off\-the\-shelf policies\.

Our key contributions are:

1. 1\.We propose Verifier\-Guided Action Selection \(VeGAS\), a test\-time verification framework for high\-level embodied reasoning that samples diverse candidate actions and uses a learned generative verifier to select the most reliable one at each timestep\. We find that using MLLMs off\-the\-shelf as verifiers does not improve performance, motivating our specialized training pipeline\.
2. 2\.Training a verifier requires demonstrations of both correct and incorrect actions, yet embodied datasets typically lack the latter\. To address this, we introduce an automated, LLM\-driven pipeline that synthesizes diverse and realistic failure trajectories paired with verification annotations, without additional human data collection\.
3. 3\.Extensive experiments in Habitat 2\.0\[[43](https://arxiv.org/html/2605.12620#bib.bib45),[45](https://arxiv.org/html/2605.12620#bib.bib2)\]and AI2\-THOR\[[34](https://arxiv.org/html/2605.12620#bib.bib35)\]show thatVeGASconsistently improves over strong CoT baselines, scales more effectively with test\-time compute than Self\-Consistency\[[47](https://arxiv.org/html/2605.12620#bib.bib56)\], and generalizes to large, off\-the\-shelf policies\.

## 2Related Work

Foundation Models for Embodied Agents\.Multimodal LLMs have proven useful for developing intelligent systems that perceive and interact with an environment\[[10](https://arxiv.org/html/2605.12620#bib.bib9),[50](https://arxiv.org/html/2605.12620#bib.bib10),[20](https://arxiv.org/html/2605.12620#bib.bib12)\]\. They have shown strong generalization skills in areas such as Language\-guided Navigation\[[31](https://arxiv.org/html/2605.12620#bib.bib11),[9](https://arxiv.org/html/2605.12620#bib.bib21),[33](https://arxiv.org/html/2605.12620#bib.bib22),[24](https://arxiv.org/html/2605.12620#bib.bib23)\], Task Planning\[[39](https://arxiv.org/html/2605.12620#bib.bib13),[36](https://arxiv.org/html/2605.12620#bib.bib18),[40](https://arxiv.org/html/2605.12620#bib.bib19),[15](https://arxiv.org/html/2605.12620#bib.bib40)\], and Embodied Decision Making\[[21](https://arxiv.org/html/2605.12620#bib.bib16),[32](https://arxiv.org/html/2605.12620#bib.bib3),[45](https://arxiv.org/html/2605.12620#bib.bib2),[49](https://arxiv.org/html/2605.12620#bib.bib17),[11](https://arxiv.org/html/2605.12620#bib.bib20)\]\. While such approaches have mostly used off\-the\-shelf or fine\-tuned LLMs/VLMs, recent works show that Chain\-of\-Thought \(CoT\)\[[48](https://arxiv.org/html/2605.12620#bib.bib26)\]can further improve performance via multi\-modal reasoning\[[28](https://arxiv.org/html/2605.12620#bib.bib14),[53](https://arxiv.org/html/2605.12620#bib.bib25)\], sub\-goal consistency\[[26](https://arxiv.org/html/2605.12620#bib.bib15)\], or spatial reasoning\[[42](https://arxiv.org/html/2605.12620#bib.bib24)\]\. Some works have explored improving multi\-agent embodied cooperation through coordinated planning\[[25](https://arxiv.org/html/2605.12620#bib.bib66)\]and tree\-search\-based collaborative deliberation\[[57](https://arxiv.org/html/2605.12620#bib.bib67)\]; while these methods orchestrate multiple off\-the\-shelf LLM agents via structured communication and joint plan search, our work targets single\-agent action reliability through a dedicated verifier trained on synthetically generated failure data\.

Verifiers\.Verification has recently emerged as a key strategy for improving LLM reasoning by evaluating and selecting among candidate solutions\. Early work trained separate verifiers to score solutions between 0 and 1, selecting the solution with the highest score as the final answer \(i\.e\., Best\-of\-N\)\[[5](https://arxiv.org/html/2605.12620#bib.bib28),[52](https://arxiv.org/html/2605.12620#bib.bib29)\]\. Recent work has shown the advantages of generative verifiers that produce verification rationales \(i\.e\., critiques/corrections\), consistently outperforming discriminative verifiers while also enhancing explainability\[[55](https://arxiv.org/html/2605.12620#bib.bib27),[2](https://arxiv.org/html/2605.12620#bib.bib30),[37](https://arxiv.org/html/2605.12620#bib.bib31)\]\. In multimodal settings, vision\-language reward models extend verification to visual outcomes\[[55](https://arxiv.org/html/2605.12620#bib.bib27),[41](https://arxiv.org/html/2605.12620#bib.bib32)\], and discriminative verifiers have been applied to low\-level control via VLA models\[[18](https://arxiv.org/html/2605.12620#bib.bib33)\]\. In contrast, our work is the first to apply generative verifiers to high\-level embodied reasoning, with an emphasis on challenging scenarios requiring novel behaviors and robustness to linguistic variations\. Concurrently,\[[13](https://arxiv.org/html/2605.12620#bib.bib68)\]propose generating and scoring multiple candidate actions along with test\-time training for embodied agents\.

Embodied Agent Benchmarks\.Several simulation platforms support embodied AI research, including AI2\-THOR\[[17](https://arxiv.org/html/2605.12620#bib.bib36),[7](https://arxiv.org/html/2605.12620#bib.bib47),[6](https://arxiv.org/html/2605.12620#bib.bib49),[12](https://arxiv.org/html/2605.12620#bib.bib48)\]and Habitat\[[27](https://arxiv.org/html/2605.12620#bib.bib46),[43](https://arxiv.org/html/2605.12620#bib.bib45),[30](https://arxiv.org/html/2605.12620#bib.bib44)\], with benchmarks spanning diverse task complexities\[[34](https://arxiv.org/html/2605.12620#bib.bib35),[29](https://arxiv.org/html/2605.12620#bib.bib50),[16](https://arxiv.org/html/2605.12620#bib.bib51),[19](https://arxiv.org/html/2605.12620#bib.bib52),[35](https://arxiv.org/html/2605.12620#bib.bib55)\]\. LangR\[[45](https://arxiv.org/html/2605.12620#bib.bib2)\], built on Habitat 2\.0\[[43](https://arxiv.org/html/2605.12620#bib.bib45)\], evaluates out\-of\-distribution generalization through two axes:paraphrastic robustness\(e\.g\., “pick up a banana”→\\rightarrow“pick up a yellow curved fruit”\) andbehavioral generalization\(e\.g\., extending single\-object tasks to multi\-object variants\)\. ALFRED\[[34](https://arxiv.org/html/2605.12620#bib.bib35)\]contains 25K language\-annotated household tasks with both high\-level goals and low\-level instructions across six core task types including pick\-and\-place, clean\-and\-place, and examine\-in\-light scenarios\. TEACH\[[29](https://arxiv.org/html/2605.12620#bib.bib50)\]extends this with over 3,000 human\-human dialogues for interactive task completion ranging from “Make Coffee” to “Prepare Breakfast”\. More recently, EmbodiedBench\[[50](https://arxiv.org/html/2605.12620#bib.bib10)\]offers a comprehensive evaluation framework with 1,128 tasks across hierarchical action levels, from high\-level planning to low\-level motor control, assessing capabilities such as spatial awareness and long\-horizon planning\.

## 3Preliminaries

We now formalize the embodied decision\-making setup and the policy architecture that serves as the foundation for our approach\.

Problem Formulation\.We formulate the agent’s task as a sequential decision\-making problem under partial observability\. The agent’s objective is to generate a sequence of actionsa1,…,aTa\_\{1\},\\ldots,a\_\{T\}to accomplish a goal specified as a natural language instructionII\(e\.g\., “Bring an item that can be used for cutting to the left counter”\)\. At each timesteptt, the agent receives an egocentric RGB imageoto\_\{t\}as its observation\. The agent’s true underlying statests\_\{t\}is not directly accessible\. The agent must decide on its next actionata\_\{t\}based on the goalIIand its historyhth\_\{t\}composed of all its past observations and actions\(o1,a1,…,ot−1,at−1,ot\)\(o\_\{1\},a\_\{1\},\.\.\.,o\_\{t\-1\},a\_\{t\-1\},o\_\{t\}\)\. Our aim is to learn a policyπ\\pithat maps the goal and history to the next action:π​\(at\|I,o1:t,a1:t−1\)\\pi\(a\_\{t\}\|I,o\_\{1:t\},a\_\{1:t\-1\}\)\. The action spaceAAconsists of high\-level semantic actions, such aspick\(apple\)andnavigate\(table\)\. Following prior work\[[45](https://arxiv.org/html/2605.12620#bib.bib2),[50](https://arxiv.org/html/2605.12620#bib.bib10)\], we assume an oracle low\-level policy that executes these high\-level actions once selected\.

Policy Architecture\.We instantiate the policyπ\\pias a multimodal large language model \(MLLM\) that takes visual and text tokens as input and autoregressively generates text tokens as output\. Given the goal in the form of a text instructionIIand the historyhth\_\{t\}as input, the policy autoregressively emits an output token sequenceyt=\(ct,at\)y\_\{t\}=\(c\_\{t\},a\_\{t\}\)\. Here,ctc\_\{t\}is an optional chain\-of\-thought rationale \(a possibly empty sequence of text tokens\) followed by the action token sequenceata\_\{t\}\. FollowingSzotet al\.\[[44](https://arxiv.org/html/2605.12620#bib.bib34)\], the actions are encoded in natural language \(e\.g\., “pick\(apple\)”\), and the output can be extracted to obtain the actionata\_\{t\}, which is sent to the environment to be executed by the low\-level policy\[[45](https://arxiv.org/html/2605.12620#bib.bib2),[50](https://arxiv.org/html/2605.12620#bib.bib10)\]\.

Policy Training\.We train the policy via imitation learning on expert demonstrations𝒟=\{τ\}\\mathcal\{D\}=\\\{\\tau\\\}, where each trajectoryτ=\(I,\(o1,a1\),…,\(oT,aT\)\)\\tau=\\big\(I,\(o\_\{1\},a\_\{1\}\),\\ldots,\(o\_\{T\},a\_\{T\}\)\\big\)depicts a successful execution of the task\. The model is fine\-tuned via supervised next\-token prediction to maximize the likelihood of the expert outputyt=\(ct,at\)y\_\{t\}=\(c\_\{t\},a\_\{t\}\), computing the loss only over output tokens, including the CoT prefixctc\_\{t\}when present\.

## 4VeGAS: Verifier\-Guided Action Selection

![Refer to caption](https://arxiv.org/html/2605.12620v1/x2.png)Figure 2:Example of a synthetic mistake and verification generated using our pipeline on the ALFRED training dataset\.Starting from a correct action \(top; ‘find a TennisRacket’\), our method introduces a mistake \(bottom\) where the agent does not locate the racket before attempting to pick it\. Our method also generates a corresponding verification explaining the mistake\.![Refer to caption](https://arxiv.org/html/2605.12620v1/x3.png)Figure 3:Synthetic data generation and training workflow for the verifier\.Successful trajectories are first processed by an LLM to produce chain\-of\-thought rationales for each action\. Then, an LLM introduces realistic and diverse errors into these trajectories and annotates every action with a verification\. This dataset is used to train the verifier through supervised finetuning\.The core idea ofVeGASis to augment a base policy with a learned verifier that, at each timestep, evaluates candidate actions and identifies the most reliable one before execution\. Because off\-the\-shelf MLLMs fail as verifiers \(as we will show in Sec\.[6\.1](https://arxiv.org/html/2605.12620#S6.SS1)\), we train a dedicated generative verifier on automatically synthesized failure data \(Sec\.[4\.1](https://arxiv.org/html/2605.12620#S4.SS1)\)\. Concretely,VeGASoperates via a Best\-of\-N procedure: at each timestep, the policy samplesNNcandidate actions, the generative verifier\[[55](https://arxiv.org/html/2605.12620#bib.bib27),[2](https://arxiv.org/html/2605.12620#bib.bib30)\]evaluates each one by producing a verification reasoning trace followed by a correctness judgement, and the highest\-scoring action is executed\. Unlike discriminative verifiers that directly output a score\[[5](https://arxiv.org/html/2605.12620#bib.bib28),[52](https://arxiv.org/html/2605.12620#bib.bib29)\], generative verifiers think step\-by\-step before assigning a score, which has been shown to yield stronger performance while also making the scores more interpretable\[[55](https://arxiv.org/html/2605.12620#bib.bib27)\]\.

### 4\.1Synthetic Reasoning and Verification Data

CoT Augmentation\.We start with a dataset of successful \(‘\+’\) trajectories,𝒟\+=\{τ\+\}\\mathcal\{D\}^\{\+\}=\\\{\\tau^\{\+\}\\\}, where a trajectoryτ\+\\tau^\{\+\}consists of the instructionII, and interleaved observationsooand actionsaa,τ\+=\{I,o1,a1\+,o2,a2\+,…\}\\tau^\{\+\}=\\\{I,o\_\{1\},a^\{\+\}\_\{1\},o\_\{2\},a^\{\+\}\_\{2\},\.\.\.\\\}\. A model trained on these trajectories will directly output the next action given the observation\. However, research in language reasoning and embodied AI has shown that thinking step\-by\-step can significantly improve the reasoning abilities of models\[[48](https://arxiv.org/html/2605.12620#bib.bib26),[28](https://arxiv.org/html/2605.12620#bib.bib14)\]\. To train embodied agents that can think step\-by\-step, similarly toZawalskiet al\.\[[53](https://arxiv.org/html/2605.12620#bib.bib25)\], we prompt a teacher LLM \(e\.g\., OpenAI o3\) to augment every action with a chain\-of\-thought reasoning,ci\+c^\{\+\}\_\{i\}, explaining why the agent should perform the expected actionai\+a^\{\+\}\_\{i\}given the previous inputsI,o1,a1\+,…,oiI,o\_\{1\},a^\{\+\}\_\{1\},\.\.\.,o\_\{i\}\. This gives us a new dataset𝒟C​o​T\+=\{τC​o​T\+\}\\mathcal\{D\}^\{\+\}\_\{CoT\}=\\\{\\tau^\{\+\}\_\{CoT\}\\\}, withτC​o​T\+=\{I,o1,\(c1\+,a1\+\),o2,\(c2\+,a2\+\),…\}\\tau^\{\+\}\_\{CoT\}=\\\{I,o\_\{1\},\(c^\{\+\}\_\{1\},a^\{\+\}\_\{1\}\),o\_\{2\},\(c^\{\+\}\_\{2\},a^\{\+\}\_\{2\}\),\.\.\.\\\}\. Note that this procedure only augments every actionai\+a^\{\+\}\_\{i\}with a chain\-of\-thought, and does not change the sequence of actions in the trajectories\. UnlikeZawalskiet al\.\[[53](https://arxiv.org/html/2605.12620#bib.bib25)\], which grounds reasoning traces in visual features such as object and gripper positions for fine\-grained manipulation, we target high\-level semantic reasoning for tasks requiring long\-horizon planning and linguistic interpretation, yielding𝒟C​o​T\+\\mathcal\{D\}^\{\+\}\_\{CoT\}\(Prompt in Appendix[10\.1](https://arxiv.org/html/2605.12620#S10.SS1)\)\.

Synthetic Failures for Verifier Training\.To train a verifier, we require examples of both correct and incorrect actions\. Because existing datasets rarely include failed executions, we introduce an automated and scalable pipeline that synthesizes unsuccessful trajectories\. For each successful trajectoryτ\+\\tau^\{\+\}, we prompt a large language model \(e\.g\., OpenAI o3\) to produce a corresponding failed trajectoryτ−\\tau^\{\-\}\. The model generates realistic and diverse mistakes that span a broad range of failure modes in challenging scenarios, including*wrong object*\(e\.g\., bringing an apple when the task requires a banana\),*wrong receptacle*\(e\.g\., placing an item on the sofa instead of the bed\), and*precondition violation*\(e\.g\., attempting to turn on a microwave without opening it first\)\. We provide examples of synthetically generated incorrect actions in Figures[2](https://arxiv.org/html/2605.12620#S4.F2)and[3](https://arxiv.org/html/2605.12620#S4.F3)and Appendix[11](https://arxiv.org/html/2605.12620#S11)\. The exact prompts are available in Appendix[10\.2](https://arxiv.org/html/2605.12620#S10.SS2)\.

For bothτ\+\\tau^\{\+\}andτ−\\tau^\{\-\}, we prompt the model to annotate every action with a verification consisting of chain\-of\-thought reasoning and a final binary judgement of the formaction\_is\_correct: yes/no\. These annotated positive and negative samples provide the supervision used to train the verifier\.

### 4\.2Verifier Training and Inference

We fine\-tune an MLLM as a verifier that takes as inputs the instructionII, all previous actionsa1,a2,…,at−1a\_\{1\},a\_\{2\},\.\.\.,a\_\{t\-1\}, and the current observationoto\_\{t\}, chain\-of\-thoughtctc\_\{t\}, and actionata\_\{t\}sampled from the policy\. It outputs a verificationvtv\_\{t\}consisting of a chain\-of\-thought and a verdict\. The verifier is trained via supervised finetuning on the data described in Sec\.[4\.1](https://arxiv.org/html/2605.12620#S4.SS1)using the same next\-token prediction objective as the policy \(Sec\.[3](https://arxiv.org/html/2605.12620#S3)\)\. This process is illustrated in Figure[3](https://arxiv.org/html/2605.12620#S4.F3)\.

During inference, at timett, we sampleNNcandidate actions from the policy:\(ct\(1\),at\(1\)\),\(ct\(2\),at\(2\)\),…,\(ct\(N\),at\(N\)\)\(c^\{\(1\)\}\_\{t\},a^\{\(1\)\}\_\{t\}\),\(c^\{\(2\)\}\_\{t\},a^\{\(2\)\}\_\{t\}\),\.\.\.,\(c^\{\(N\)\}\_\{t\},a^\{\(N\)\}\_\{t\}\)\. We pass each candidate\(ct\(n\),at\(n\)\)\(c^\{\(n\)\}\_\{t\},a^\{\(n\)\}\_\{t\}\)to the verifier and, following the original GenRM procedure\[[55](https://arxiv.org/html/2605.12620#bib.bib27)\], sampleMMverifications per action to reduce variance, each consisting of a verification chain\-of\-thought and a verdict\. The verdict can be mapped to a score \(‘yes’→\\rightarrow1 and ‘no’→\\rightarrow0\), giving usMMscores per action\. We average these scores to obtain a final score for every action,σt\(n\)\\sigma\_\{t\}^\{\(n\)\}\. Finally, we select the highest\-scoring action \(Best\-of\-N\):at=argmaxn∈\[N\]⁡\[σt\(n\)\]a\_\{t\}=\\operatorname\{argmax\}\_\{n\\in\[N\]\}\[\\sigma\_\{t\}^\{\(n\)\}\]\. The selected actionata\_\{t\}is then executed in the environment, and the process repeats at the next timestep\.

## 5Experimental Setup

### 5\.1Benchmark Details

We evaluate our approach on two embodied AI benchmarks targeting out\-of\-distribution generalization: LangR\[[45](https://arxiv.org/html/2605.12620#bib.bib2)\]in the Habitat 2\.0 simulator\[[43](https://arxiv.org/html/2605.12620#bib.bib45)\], and ALFRED\[[34](https://arxiv.org/html/2605.12620#bib.bib35)\]in the AI2\-THOR simulator\[[17](https://arxiv.org/html/2605.12620#bib.bib36)\]\. In both benchmarks, the agent is placed in previously unseen indoor environments and tasked with completing multi\-step household tasks \(e\.g\., rearranging objects, examining items\) specified through natural language\. At each timestep, the agent receives an egocentric RGB observation and selects a high\-level semantic action \(e\.g\.,navigate\(table\),pick\(apple\),open\(fridge\)\), which is executed by the simulator\.

LangR\[[45](https://arxiv.org/html/2605.12620#bib.bib2)\]\.This benchmark comprises a diverse set of training tasks featuring multiple paraphrastic variations and interactions with different household objects\. The benchmark also includes several out\-of\-distribution \(OOD\) tasks designed to evaluate the model’s generalization capabilities\. These evaluation tasks differ from the training set in their natural language instructions, which vary either through linguistic reformulation \(termedParaphrastic Robustness, e\.g\., “pick up a banana”→\\rightarrow“pick up a yellow curved fruit”\) or through changes in the underlying task structure \(termedBehavioral Generalization, e\.g\., “move an apple and a banana”→\\rightarrow“move an apple, a banana, and a ball”\)\. The evaluation suite comprises 8 tasks with 100 instructions each\. Further details are available in Appendix[8](https://arxiv.org/html/2605.12620#S8)\.

ALFRED\[[34](https://arxiv.org/html/2605.12620#bib.bib35)\]\.This benchmark, built upon the AI2\-THOR simulator\[[17](https://arxiv.org/html/2605.12620#bib.bib36)\], encompasses seven distinct task types \(such aspick and placeandexamine in light\) that involve diverse interactions with objects in household environments\. In this work, we employ the EB\-ALFRED implementation introduced byYanget al\.\[[50](https://arxiv.org/html/2605.12620#bib.bib10)\], which reorganizes tasks from the original benchmark into several categories designed to assess different aspects of OOD generalization, includinglong\-horizon tasks,common\-sense reasoning, andspatial understanding\. The evaluation suite comprises 6 tasks with 50 instructions each\. Further details are in Appendix[8](https://arxiv.org/html/2605.12620#S8)\.

### 5\.2Policy and Verifier

Policy Training\.LangR does not include expert demonstrations for training\. To construct its training dataset𝒟\+\\mathcal\{D\}^\{\+\}, we execute an RL\-trained policy fromSzotet al\.\[[45](https://arxiv.org/html/2605.12620#bib.bib2)\]on the LangR training split, collecting 10K trajectories, each comprising an instruction, observations, and actions\. Trajectories in which the agent failed to complete the task are discarded \(fewer than 3% of the total\)\. For EB\-ALFRED, we use the instructions, observations, and actions from the training data provided by the original ALFRED benchmark\[[34](https://arxiv.org/html/2605.12620#bib.bib35)\], which contains approximately 6\.5K expert demonstrations\. For both benchmarks, we prompt OpenAI’s o3 model to augment every action in the expert trajectories with a chain\-of\-thought \(see prompts in Appendix[10\.1](https://arxiv.org/html/2605.12620#S10.SS1)\), yielding𝒟C​o​T\+\\mathcal\{D\}^\{\+\}\_\{CoT\}\(as described in Sec\.[4\.1](https://arxiv.org/html/2605.12620#S4.SS1)\)\. We fine\-tune Qwen2\.5\-VL\-3B\-Instruct\[[3](https://arxiv.org/html/2605.12620#bib.bib58)\]on𝒟C​o​T\+\\mathcal\{D\}^\{\+\}\_\{CoT\}to obtain the chain\-of\-thought \(CoT\) policy\. Additional implementation and hyperparameter details are provided in Appendix[9](https://arxiv.org/html/2605.12620#S9)\.

Verifier Training\.We begin with approximately 4\.5K and 6\.5K successful trajectories from LangR and ALFRED, respectively\. To generate negative samples, we prompt OpenAI’s o3 model to synthesize one failed trajectory corresponding to each successful example, and to provide verification annotations for every action within the failed trajectory \(see prompts in Appendix[10\.2](https://arxiv.org/html/2605.12620#S10.SS2)\)\. This process yields a corpus of failed trajectories, denoted as𝒟C​o​T−\\mathcal\{D\}^\{\-\}\_\{CoT\}\. The same model is further used to annotate each action in the successful trajectories with corresponding verifications\. We combine the verifications from both successful and failed trajectories, and randomly sample from this pool to construct a balanced dataset containing equal numbers of correct and incorrect samples\. To ensure a fair comparison, we use the same base model—Qwen2\.5\-VL\-3B\-Instruct\[[3](https://arxiv.org/html/2605.12620#bib.bib58)\]—for both the policy and the verifier, differing only in their training data\. Additional implementation details are provided in Appendix[9](https://arxiv.org/html/2605.12620#S9)\.

Inference\.We usevLLM\[[23](https://arxiv.org/html/2605.12620#bib.bib59)\]to perform inference with the policy and verifier\. For Habitat 2\.0, we run experiments on NVIDIA L40 GPUs\. For ALFRED, we perform experiments on NVIDIA A100 80GB GPUs\. For the No\-CoT and CoT policies, we sample the model responses via greedy decoding\. When sampling multiple candidate actions, we sample actions and verifications with a temperature of 0\.7\. ForVeGAS, we sampleN=16N=16candidate actions andM=5M=5verifications per action at every timestep\. We report results and comparisons against baselines in Sec\.[6](https://arxiv.org/html/2605.12620#S6)\.

## 6Experiments

Table 1:Success rates on the LangR\[[45](https://arxiv.org/html/2605.12620#bib.bib2)\]benchmark\.The No\-CoT, w/ CoT, andVeGASmodels are based on Qwen\-2\.5\-VL\-3B\-Instruct\. ZS and FT refer to zero\-shot and finetuning, respectively\. For verifier\-based approaches, results are averaged over three runs\. The No\-CoT and CoT variants use greedy decoding for action selection, yielding deterministic outcomes\. Our proposed approach, which combines chain\-of\-thought reasoning with a finetuned verifier \(VeGAS\), consistently outperforms all baselines\.First, we evaluate the impact of zero\-shot and finetuned verifiers on out\-of\-distribution generalization \(Sec\.[6\.1](https://arxiv.org/html/2605.12620#S6.SS1)\)\. Next, we examine whether our finetuned verifier can improve larger, off\-the\-shelf policies it was never trained with \(Sec\.[5](https://arxiv.org/html/2605.12620#S6.F5)\)\. Finally, we present ablation studies examining training pipeline choices, test\-time compute scaling, and latency \(Sec\.[6\.2](https://arxiv.org/html/2605.12620#S6.SS2)\)\.

### 6\.1Verifiers Improve Generalization, But Only When Finetuned

We analyze the effect of verification on the LangR benchmark \(Table[1](https://arxiv.org/html/2605.12620#S6.T1)\)\. Fine\-tuning with CoT supervision achieves 65% average success rate, surpassing prior state\-of\-the\-art SemLang\[[44](https://arxiv.org/html/2605.12620#bib.bib34)\]and establishing a strong baseline\.

We then evaluate whether verification can further improve the CoT policy\. Using the same Qwen2\.5\-VL\-3B\-Instruct model as a zero\-shot verifier \(\+ ZS Verifier\) does not meaningfully improve over the CoT baseline and, in fact, slightly hurts average performance \(64% vs\. 65%\)\. This shows that the Best\-of\-N selection paradigm alone is insufficient; without task\-specific training, the verifier cannot reliably distinguish correct actions from incorrect ones\. In contrast, equipping the CoT policy with our finetuned verifier \(VeGAS\) raises performance to 71%, with consistent gains across all task categories\. The improvements are particularly pronounced in challenging scenarios such asMultiple Objects, whereVeGASprovides roughly a 36% relative improvement over CoT alone and doubles performance compared to No\-CoT\. These results demonstrate that the gains ofVeGASstem not from sampling multiple candidates per se, but from our synthetic failure generation and verifier training pipeline\. An example in Figure[4](https://arxiv.org/html/2605.12620#S6.F4)highlights the verifier’s effectiveness, where it correctly flags an action arising from the agent misunderstanding the instruction\.

To test whether these findings generalize beyond LangR, we repeat the evaluation on EB\-ALFRED \(Table[2](https://arxiv.org/html/2605.12620#S6.T2)\)\. Our CoT policy, obtained by fine\-tuning Qwen2\.5\-VL\-3B\-Instruct, achieves an average success rate of 44%, as shown in Table[2](https://arxiv.org/html/2605.12620#S6.T2)\. This surpasses Qwen2\.5\-VL\-72B\-Instruct \(success rate 30%\), a∼\\sim20×\\timeslarger model, and the best open\-weights model reported on EB\-ALFRED\[[50](https://arxiv.org/html/2605.12620#bib.bib10)\], thereby establishing a strong baseline\. As on LangR, using the same model as a zero\-shot verifier \(\+ ZS Verifier\) does not improve over the CoT baseline \(44% vs\. 44%\)\. In contrast, equipping the policy with our finetuned verifier \(VeGAS\) raises performance to 49%, confirming that task\-specific verifier training is essential for effective verification\. To assess whether these findings generalize beyond a single backbone, we repeat the experiment with Gemma\-3\-4B\[[46](https://arxiv.org/html/2605.12620#bib.bib65)\], fine\-tuning both the CoT policy and the verifier\. We observe the same trends: the zero\-shot verifier provides no meaningful gains, whileVeGASimproves the average success rate to 51%\. This consistency across two different model families demonstrates that the benefits ofVeGASare not architecture\-specific but arise from the finetuned verifier itself\. Taken together, our results on LangR and EB\-ALFRED demonstrate that verifying actions at test time can substantially enhance the generalization capabilities of embodied agents in challenging scenarios\. Figure[5](https://arxiv.org/html/2605.12620#S6.F5)shows an example where the verifier reliably detects a subtle error made by the policy\.

![Refer to caption](https://arxiv.org/html/2605.12620v1/images/hab_qual_t=0.png)![Refer to caption](https://arxiv.org/html/2605.12620v1/images/hab_qual_t=1.png)

Instruction:Acquire asports objectand position it in the brown table\.Sampled Action:<task\>Acquire a sports object and position it on the brown table; remaining: locate the object, pick it up, navigate to the brown table, and place it\.</task\> <plan\>\[search sink, search Table1, search TV stand, search sofa, pick object, navigate to brown table, place object\]</plan\> <subtask\_reason\>Having arrived at the sink, I must check if the object is here and grasp it if present; hence I need to dopick\_sponge\(robot\_0\)\.</subtask\_reason\> <subtask\>attempt pick at sink</subtask\> <action\>\(’pick\_sponge\(robot\_0\)’, 28\)</action\>Verification CoT:The instruction requires picking up aSPORTS object \(likely a ball\)\. The chosen action is pick\_sponge, which grasps a sponge instead\. Itdoes not move the correct objecttoward completion of the task and therefore fails to advance the goal\.action\_is\_correct: no

Figure 4:Verification example on the LangR benchmark\.Here, the task is to move a “sports object”, but the agent proposes ‘pick\_sponge’\. The verifier correctly identifies that the sponge is not a sports object, classifying the action as incorrect\.Table 2:Success rates on EB\-ALFRED\[[50](https://arxiv.org/html/2605.12620#bib.bib10)\]\.ZS and FT refer to zero\-shot and finetuning, respectively\. The CoT policy employs greedy decoding, producing deterministic outcomes\. Results involving verifiers are averaged over three runs\. Our finetuned verifier consistently outperforms the CoT policy as well as the zero\-shot verifier\. Further, our finetuned verifier improves performance of larger policies, showing cross\-model generalization\.![Refer to caption](https://arxiv.org/html/2605.12620v1/images/alf_qual_1.png)![Refer to caption](https://arxiv.org/html/2605.12620v1/images/alf_qual_2.png)

Instruction:Move the book from the table to thelargecomfortable lounging spot\.Sampled Action:<task\>Book is in hand; it still must be placed on the large comfortable lounge chair\.</task\> <plan\>\[1\) Find/navigate to the lounge chair, 2\) Put the book down\]</plan\> <subtask\_reason\>I cannot place the book until I have reached the correct furniture item\. Its exact location relative to the current viewpoint is not confirmed, so I should perform another search targeted at anArmChair, hence I should choose the action that finds anArmChair\.</subtask\_reason\> <subtask\>Find the lounge chair</subtask\> <action\>find a ArmChair</action\>\.Verification CoT:The instruction specifies thelargecomfortable lounging spot, which in the environment corresponds to theSofaobject\. Searching for an ArmChair indicates the agent hasmis\-identified the target receptacle\. While the action executes successfully, it does not advance the true goal\.action\_is\_correct: no

Figure 5:Verification example on the ALFRED benchmark\.The task is to move the book to a “large comfortable lounging spot”, but the agent makes a subtle mistake and proposes the related receptacle, armchair\. The verifier correctly identifies that the correct receptacle is Sofa, classifying the action as incorrect\.Verifier\-Guided Improvement of Large Policies\.We evaluate whether a small, finetuned verifier can improve large policies it was never trained with — a practical setting where large models are inaccessible for fine\-tuning\. We pair our Qwen2\.5\-VL\-3B\-Instruct verifier with several zero\-shot policies on EB\-ALFRED \(Table[2](https://arxiv.org/html/2605.12620#S6.T2)\)\. Our verifier consistently improves every policy it is paired with; most notably, it improves Qwen\-2\.5\-VL\-72B—a model∼\\sim20×\\timesits own size—from 30% to 38%, demonstrating that a compact verifier can meaningfully enhance policies far beyond its own scale\.

### 6\.2Ablation Studies

Impact of Scaling Candidate Actions\.Increasing the number of sampled candidate actions increases the likelihood of including at least one correct action\[[4](https://arxiv.org/html/2605.12620#bib.bib53)\], but also increases inference cost\. To isolate whether gains come from the verifier or from sampling diversity alone, we compare against self\-consistency\[[47](https://arxiv.org/html/2605.12620#bib.bib56)\], which samples multiple actions and selects by majority vote\. To ensure a fair comparison, we match total LLM calls across both methods: followingSinghiet al\.\[[37](https://arxiv.org/html/2605.12620#bib.bib31)\], ifVeGASsamplesNNactions withMMverifications each, self\-consistency samplesN​\(M\+1\)N\(M\+1\)actions\.111The total number of LLM calls forVeGASisN\+N×M=N​\(M\+1\)N\+N\\times M=N\(M\+1\):NNpolicy calls plusN×MN\\times Mverification calls\.As shown in Figure[6](https://arxiv.org/html/2605.12620#S6.F6), on EB\-ALFRED, self\-consistency improves with additional compute but does not scale as efficiently asVeGAS, which exhibits steeper and more consistent gains\. These results underscore that a finetuned verifier is crucial for effectively leveraging additional test\-time compute to improve embodied agent performance\.

![Refer to caption](https://arxiv.org/html/2605.12620v1/x4.png)Figure 6:Scaling candidate actions on EB\-ALFRED\.Average success rate as the number of candidate actionsNNincreases\. Both methods use the same total number of LLM calls\.VeGASscales better with compute than Self\-Consistency\.Sampled Candidates Reliably Contain Correct Actions\.Best\-of\-N selection can only succeed if the candidate set contains at least one correct action\. We therefore ask: how often does at least one correct action appear amongNNsampled candidates? Since no ground\-truth oracle exists for action correctness, we use o3 as a judge\. We measure this coverage probability on LangR as a function ofNN\. As shown in Table[3](https://arxiv.org/html/2605.12620#S6.T3), the coverage rises sharply withNN: with just 10 candidates, at least one correct action is present in 89% of cases\. This confirms that the policy’s candidate set is highly likely to contain a correct action, making Best\-of\-N an effective strategy\.

Table 3:Candidate set coverage on LangR\.Probability that at least one correct action appears amongNNcandidates\.Teacher Model Sensitivity\.We investigate whether replacing o3 with the cheaper Qwen\-3\-VL\-8B\-thinking as teacher degrades verifier quality\. As shown in Table[4](https://arxiv.org/html/2605.12620#S6.T4), the weaker\-teacher verifier achieves 69%, compared to 71% with o3\-generated data and 65% for the CoT baseline\. While o3 yields the best performance, a much cheaper teacher still provides meaningful gains, making our pipeline accessible without requiring expensive frontier models\.

Table 4:Teacher model sensitivity on LangR\.Average success rate when using Qwen\-3\-VL\-8B\-thinking vs\. o3 to generate synthetic verifier training data\.Latency\.A natural concern withVeGASis inference latency: samplingNNcandidate actions andMMverifications per action requiresN​\(M\+1\)N\(M\+1\)total LLM calls\. However, since all candidates and verifications can be sampled in parallel, the wall\-clock overhead is far more modest than the raw call count suggests\. Table[5](https://arxiv.org/html/2605.12620#S6.T5)reports latency as we scaleNNatM=5M=5\. Going fromN=1N=1\(a single greedy action, 1 LLM call\) toN=8N=8\(48 LLM calls, 48×\\timesmore\) increases latency by only 2×\\times\(3s→\\to6s\)\. This demonstrates that parallel sampling makesVeGASpractical for deployment even at larger compute budgets\.

Table 5:Latency vs\. compute budget\.Wall\-clock time to sample all actions and verifications for increasingNN\(candidate actions\)\.Impact of Visual Input to the Verifier\.We ask: if we train a verifier that is text\-only \(i\.e\., receiving only the action and its reasoning CoT, without the egocentric image\), how much does performance degrade? On LangR, the text\-only verifier achieves the same average success rate as the multimodal verifier \(71% vs\. 71%\)\. On EB\-ALFRED, we observe only a marginal drop \(49% vs\. 47\.5%\)\. We note that the text\-only verifier is not truly “blind”: the chain\-of\-thought reasoning trace accompanying each candidate action describes the visual scene in natural language, which may explain why removing the image input does not meaningfully hurt performance\. This is consistent with prior work showing that text\-based scene descriptions are sufficient for high\-level embodied planning\[[14](https://arxiv.org/html/2605.12620#bib.bib1),[50](https://arxiv.org/html/2605.12620#bib.bib10)\]\. We also speculate that current high\-level benchmarks lack complex scenarios with occlusions or fine\-grained visual distinctions where explicit vision\-based verification would be most beneficial\.

## 7Conclusion

We introducedVerifier\-Guided Action Selection \(VeGAS\), a test\-time framework that improves the out\-of\-distribution robustness of embodied agents via an explicit verification step\. Using an automated pipeline to synthesize failure trajectories for verifier training,VeGASachieves consistent gains on LangR and EB\-ALFRED, including over significantly larger off\-the\-shelf policies\. Our analyses show that verifier finetuning is essential for reliably leveraging additional test\-time compute\.

## Acknowledgements

Nishad Singhi is supported by a LOEWE Start\-Professur \(LOEWE/4b//519/05\.01\.002\-\(0006\)/94\)\. Marcus Rohrbach is supported in part by an Alexander von Humboldt Professorship in Multimodal Reliable AI, sponsored by Germany’s Federal Ministry for Education and Research\. For compute, we gratefully acknowledge support from the hessian\.AI Service Center \(funded by the Federal Ministry of Research, Technology and Space, BMFTR, grant no\. 16IS22091\) and the hessian\.AI Innovation Lab \(funded by the Hessian Ministry for Digital Strategy and Innovation, grant no\. S\-DIW04/0013/003\)\. The work has benefited from the Excellence Cluster “Reasonable AI” by the Deutsche Forschungsgemeinschaft \(DFG, German Research Foundation\) under Germany’s Excellence Strategy – EXC\-3057, DFG Emmy Noether Programme \(CH 2676/1\-1\), European Union’s Horizon Europe project “MANiBOT” \(Grant No\.: 101120823\), European Union’s Horizon Europe project “ARISE” \(Grant No\.: 101135959\), BMFTR Project “RIG” \(Grant No\.: 16ME1001\)\.

## References

- \[1\]M\. Ahn, A\. Brohan, N\. Brown, Y\. Chebotar, O\. Cortes, B\. David, C\. Finn, C\. Fu, K\. Gopalakrishnan, K\. Hausman,et al\.\(2022\)Do as i can, not as i say: grounding language in robotic affordances\.arXiv preprint arXiv:2204\.01691\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1)\.
- \[2\]\(2024\)Critique\-out\-loud reward models\.arXiv preprint arXiv:2408\.11791\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p3.1),[§2](https://arxiv.org/html/2605.12620#S2.p2.1),[§4](https://arxiv.org/html/2605.12620#S4.p1.1)\.
- \[3\]S\. Bai, S\. Yang, S\. Wang, C\. Sun, Z\. Gong, Y\. Yang, Y\. Qian, X\. Ren, Z\. Wei, Z\. Su,et al\.\(2025\)Qwen2\.5\-vl: a state\-of\-the\-art vision\-language model series\.arXiv preprint arXiv:2502\.13923\.Cited by:[§5\.2](https://arxiv.org/html/2605.12620#S5.SS2.p1.3),[§5\.2](https://arxiv.org/html/2605.12620#S5.SS2.p2.1)\.
- \[4\]B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini\(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p2.1),[§6\.2](https://arxiv.org/html/2605.12620#S6.SS2.p1.3)\.
- \[5\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p2.1),[§2](https://arxiv.org/html/2605.12620#S2.p2.1),[§4](https://arxiv.org/html/2605.12620#S4.p1.1)\.
- \[6\]M\. Deitke, W\. Han, A\. Herrasti, A\. Kembhavi, E\. Kolve, R\. Mottaghi, J\. Salvador, D\. Schwenk, E\. VanderBilt, M\. Wallingford, L\. Weihs, M\. Yatskar, and A\. Farhadi\(2020\)RoboTHOR: An Open Simulation\-to\-Real Embodied AI Platform\.InCVPR,Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[7\]M\. Deitke, E\. VanderBilt, A\. Herrasti, L\. Weihs, J\. Salvador, K\. Ehsani, W\. Han, E\. Kolve, A\. Farhadi, A\. Kembhavi, and R\. Mottaghi\(2022\)ProcTHOR: Large\-Scale Embodied AI Using Procedural Generation\.InNeurIPS,Note:Outstanding Paper AwardCited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[8\]Y\. Ding, X\. Zhang, C\. Paxton, and S\. Zhang\(2023\)Task and motion planning with large language models for object rearrangement\.InIEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\),Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1)\.
- \[9\]V\. S\. Dorbala, J\. F\. Mullen, and D\. Manocha\(2023\)Can an embodied agent find your “cat\-shaped mug”? llm\-based zero\-shot object navigation\.IEEE Robotics and Automation Letters\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[10\]M\. Du, B\. Wu, Z\. Li, X\. Huang, and Z\. Wei\(2024\)Embspatial\-bench: benchmarking spatial understanding for embodied tasks with large vision\-language models\.InAnnual Meeting of the Association for Computational Linguistics \(Short Papers\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[11\]Y\. Du, O\. Watkins, Z\. Wang, C\. Colas, T\. Darrell, P\. Abbeel, A\. Gupta, and J\. Andreas\(2023\)Guiding pretraining in reinforcement learning with large language models\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[12\]K\. Ehsani, W\. Han, A\. Herrasti, E\. VanderBilt, L\. Weihs, E\. Kolve, A\. Kembhavi, and R\. Mottaghi\(2021\)ManipulaTHOR: A Framework for Visual Object Manipulation\.InCVPR,Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[13\]Y\. Hong, H\. Huang, M\. Li, L\. F\. Li, J\. Wu, and Y\. Choi\(2026\)Learning from trials and errors: reflective test\-time planning for embodied llms\.External Links:[Link](http://arxiv.org/abs/2602.21198)Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p2.1)\.
- \[14\]J\. Huang, A\. Sethi, M\. Kuo, M\. Keoliya, N\. Velingker, J\. Jung, S\. Lim, Z\. Li, and M\. Naik\(2025\)ESCA: contextualizing embodied agents via scene\-graph generation\.arXiv preprint arXiv:2510\.15963\.Cited by:[§6\.2](https://arxiv.org/html/2605.12620#S6.SS2.p5.1)\.
- \[15\]W\. Huang, P\. Abbeel, D\. Pathak, and I\. Mordatch\(2022\)Language models as zero\-shot planners: extracting actionable knowledge for embodied agents\.InInternational conference on machine learning,pp\. 9118–9147\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[16\]Y\. Kant, A\. Ramachandran, S\. Yenamandra, I\. Gilitschenski, D\. Batra, A\. Szot, and H\. Agrawal\(2022\)Housekeep: tidying virtual households using commonsense reasoning\.InEuropean Conference on Computer Vision,pp\. 355–373\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[17\]E\. Kolve, R\. Mottaghi, W\. Han, E\. VanderBilt, L\. Weihs, A\. Herrasti, M\. Deitke, K\. Ehsani, D\. Gordon, Y\. Zhu,et al\.\(2017\)Ai2\-thor: an interactive 3d environment for visual ai\.arXiv preprint arXiv:1712\.05474\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p3.1),[§8\.2](https://arxiv.org/html/2605.12620#S8.SS2.p1.1)\.
- \[18\]J\. Kwok, C\. Agia, R\. Sinha, M\. Foutter, S\. Li, I\. Stoica, A\. Mirhoseini, and M\. Pavone\(2025\)RoboMonkey: scaling test\-time sampling and verification for vision\-language\-action models\.arXiv preprint arXiv:2506\.17811\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p2.1)\.
- \[19\]C\. Li, R\. Zhang, J\. Wong, C\. Gokmen, S\. Srivastava, R\. Martín\-Martín, C\. Wang, G\. Levine, M\. Lingelbach, J\. Sun,et al\.\(2023\)Behavior\-1k: a benchmark for embodied ai with 1,000 everyday activities and realistic simulation\.InConference on Robot Learning,pp\. 80–93\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[20\]M\. Li, S\. Zhao, Q\. Wang, K\. Wang, Y\. Zhou, S\. Srivastava, C\. Gokmen, T\. Lee, E\. L\. Li, R\. Zhang,et al\.\(2024\)Embodied agent interface: benchmarking llms for embodied decision making\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[21\]S\. Li, X\. Puig, C\. Paxton, Y\. Du, C\. Wang, L\. Fan, T\. Chen, D\. Huang, E\. Akyürek, A\. Anandkumar,et al\.\(2022\)Pre\-trained language models for interactive decision\-making\.Advances in Neural Information Processing Systems35,pp\. 31199–31212\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[22\]W\. Li, Z\. Yu, Q\. She, Z\. Yu, Y\. Lan, C\. Zhu, R\. Hu, and K\. Xu\(2024\)LLM\-enhanced scene graph learning for household rearrangement\.InSIGGRAPH Asia 2024,Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1)\.
- \[23\]Z\. Li, E\. Jin, X\. Li, H\. Luo, M\. Zhang, M\. Shoeybi, T\. Du, S\. Wang, J\. Sun, S\. Yan, Z\. Yu, X\. He, Y\. Wang, M\. Wiseman, A\. Ahmed, M\. Li, C\. Zhang, J\. E\. Gonzalez, I\. Stoica, and T\. Zhao\(2023\)VLLM: easy, fast, and cheap llm inference\.InProceedings of Neural Information Processing Systems \(NeurIPS\) Demo Track,Cited by:[§5\.2](https://arxiv.org/html/2605.12620#S5.SS2.p3.2)\.
- \[24\]B\. Lin, Y\. Nie, Z\. Wei, J\. Chen, S\. Ma, J\. Han, H\. Xu, X\. Chang, and X\. Liang\(2025\)Navcot: boosting llm\-based vision\-and\-language navigation via learning disentangled reasoning\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[25\]J\. Liu, P\. Zhou, Y\. Du, A\. Tan, C\. G\. Snoek, J\. Sonke, and E\. Gavves\(2024\)Capo: cooperative plan optimization for efficient embodied multi\-agent cooperation\.arXiv preprint arXiv:2411\.04679\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[26\]G\. Lu, Z\. Wang, C\. Liu, J\. Lu, and Y\. Tang\(2025\)ThinkBot: embodied instruction following with thought chain reasoning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[27\]Manolis Savva\*, Abhishek Kadian\*, Oleksandr Maksymets\*, Y\. Zhao, E\. Wijmans, B\. Jain, J\. Straub, J\. Liu, V\. Koltun, J\. Malik, D\. Parikh, and D\. Batra\(2019\)Habitat: A Platform for Embodied AI Research\.InIEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[28\]Y\. Mu, Q\. Zhang, M\. Hu, W\. Wang, M\. Ding, J\. Jin, B\. Wang, J\. Dai, Y\. Qiao, and P\. Luo\(2023\)Embodiedgpt: vision\-language pre\-training via embodied chain of thought\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.12620#S4.SS1.p1.13)\.
- \[29\]A\. Padmakumar, J\. Thomason, A\. Shrivastava, P\. Lange, A\. Narayan\-Chen, S\. Gella, R\. Piramuthu, G\. Tur, and D\. Hakkani\-Tur\(2022\)Teach: task\-driven embodied agents that chat\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 2017–2025\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[30\]X\. Puig, E\. Undersander, A\. Szot, M\. D\. Cote, T\. Yang, R\. Partsey, R\. Desai, A\. Clegg, M\. Hlavac, S\. Y\. Min,et al\.\(2024\)Habitat 3\.0: a co\-habitat for humans, avatars, and robots\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[31\]Y\. Qiao, W\. Lyu, H\. Wang, Z\. Wang, Z\. Li, Y\. Zhang, M\. Tan, and Q\. Wu\(2025\)Open\-nav: exploring zero\-shot vision\-and\-language navigation in continuous environment with open\-source llms\.InIEEE International Conference on Robotics and Automation \(ICRA\),Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[32\]G\. Sarch, Y\. Wu, M\. Tarr, and K\. Fragkiadaki\(2023\)Open\-ended instructable embodied agents with memory\-augmented large language models\.InACL Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[33\]R\. Schumann, W\. Zhu, W\. Feng, T\. Fu, S\. Riezler, and W\. Y\. Wang\(2024\)Velma: verbalization embodiment of llm agents for vision and language navigation in street view\.InAAAI Conference on Artificial Intelligence,Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[34\]M\. Shridhar, J\. Thomason, D\. Gordon, Y\. Bisk, W\. Han, R\. Mottaghi, L\. Zettlemoyer, and D\. Fox\(2020\)Alfred: a benchmark for interpreting grounded instructions for everyday tasks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10740–10749\.Cited by:[item 3](https://arxiv.org/html/2605.12620#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.12620#S1.p4.1),[§2](https://arxiv.org/html/2605.12620#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p3.1.1),[§5\.2](https://arxiv.org/html/2605.12620#S5.SS2.p1.3),[§8\.2](https://arxiv.org/html/2605.12620#S8.SS2.p1.1)\.
- \[35\]M\. Shridhar, X\. Yuan, M\. Cote, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p3.1)\.
- \[36\]I\. Singh, V\. Blukis, A\. Mousavian, A\. Goyal, D\. Xu, J\. Tremblay, D\. Fox, J\. Thomason, and A\. Garg\(2023\)ProgPrompt: generating situated robot task plans using large language models\.InIEEE International Conference on Robotics and Automation \(ICRA\),Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[37\]N\. Singhi, H\. Bansal, A\. Hosseini, A\. Grover, K\. Chang, M\. Rohrbach, and A\. Rohrbach\(2025\)When to solve, when to verify: compute\-optimal problem solving and generative verification for llm reasoning\.InConference on Language Modelling \(COLM\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p2.1),[§6\.2](https://arxiv.org/html/2605.12620#S6.SS2.p1.3)\.
- \[38\]C\. Snell, J\. Lee, K\. Xu, and A\. Kumar\(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p2.1)\.
- \[39\]C\. H\. Song, J\. Wu, C\. Washington, B\. M\. Sadler, W\. Chao, and Y\. Su\(2023\)Llm\-planner: few\-shot grounded planning for embodied agents with large language models\.InIEEE/CVF International Conference on Computer Vision \(ICCV\),Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[40\]H\. Sun, Y\. Zhuang, L\. Kong, B\. Dai, and C\. Zhang\(2023\)Adaplanner: adaptive planning from feedback with language models\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[41\]L\. Sun, H\. Liang, J\. Wei, B\. Yu, T\. Li, F\. Yang, Z\. Zhou, and W\. Zhang\(2025\)Mm\-verify: enhancing multimodal reasoning with chain\-of\-thought verification\.arXiv preprint arXiv:2502\.13383\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p2.1)\.
- \[42\]Q\. Sun, P\. Hong, T\. D\. Pala, V\. Toh, U\. Tan, D\. Ghosal, and S\. Poria\(2025\)Emma\-x: an embodied multimodal action model with grounded chain of thought and look\-ahead spatial reasoning\.InAnnual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[43\]A\. Szot, A\. Clegg, E\. Undersander, E\. Wijmans, Y\. Zhao, J\. Turner, N\. Maestre, M\. Mukadam, D\. Chaplot, O\. Maksymets, A\. Gokaslan, V\. Vondrus, S\. Dharur, F\. Meier, W\. Galuba, A\. Chang, Z\. Kira, V\. Koltun, J\. Malik, M\. Savva, and D\. Batra\(2021\)Habitat 2\.0: training home assistants to rearrange their habitat\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[item 3](https://arxiv.org/html/2605.12620#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.12620#S1.p4.1),[§2](https://arxiv.org/html/2605.12620#S2.p3.1),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p1.1),[§8\.1](https://arxiv.org/html/2605.12620#S8.SS1.p1.1)\.
- \[44\]A\. Szot, B\. Mazoure, H\. Agrawal, R\. D\. Hjelm, Z\. Kira, and A\. Toshev\(2024\)Grounding multimodal large language models in actions\.Advances in Neural Information Processing Systems37,pp\. 20198–20224\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§3](https://arxiv.org/html/2605.12620#S3.p3.7),[§6\.1](https://arxiv.org/html/2605.12620#S6.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12620#S6.T1.2.1.5.5.1.1)\.
- \[45\]A\. Szot, M\. Schwarzer, H\. Agrawal, B\. Mazoure, R\. Metcalf, W\. Talbott, N\. Mackraz, R\. D\. Hjelm, and A\. T\. Toshev\(2023\)Large language models as generalizable policies for embodied tasks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[item 3](https://arxiv.org/html/2605.12620#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§1](https://arxiv.org/html/2605.12620#S1.p4.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p3.1),[§3](https://arxiv.org/html/2605.12620#S3.p2.12),[§3](https://arxiv.org/html/2605.12620#S3.p3.7),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p2.2.1),[§5\.2](https://arxiv.org/html/2605.12620#S5.SS2.p1.3),[Table 1](https://arxiv.org/html/2605.12620#S6.T1.2.1.4.4.1.1),[Table 1](https://arxiv.org/html/2605.12620#S6.T1.3.1),[Table 1](https://arxiv.org/html/2605.12620#S6.T1.7.2),[§8\.1](https://arxiv.org/html/2605.12620#S8.SS1.p1.1)\.
- \[46\]G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot\(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§6\.1](https://arxiv.org/html/2605.12620#S6.SS1.p3.2)\.
- \[47\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[item 3](https://arxiv.org/html/2605.12620#S1.I1.i3.p1.1),[§6\.2](https://arxiv.org/html/2605.12620#S6.SS2.p1.3)\.
- \[48\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in Neural Information Processing Systems \(NeurIPS\)\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.12620#S4.SS1.p1.13)\.
- \[49\]X\. Yan, Y\. Song, X\. Feng, M\. Yang, H\. Zhang, H\. B\. Ammar, and J\. Wang\(2025\)Efficient reinforcement learning with large language model priors\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.
- \[50\]R\. Yang, H\. Chen, J\. Zhang, M\. Zhao, C\. Qian, K\. Wang, Q\. Wang, T\. V\. Koripella, M\. Movahedi, M\. Li,et al\.\(2025\)EmbodiedBench: comprehensive benchmarking multi\-modal large language models for vision\-driven embodied agents\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1),[§2](https://arxiv.org/html/2605.12620#S2.p3.1),[§3](https://arxiv.org/html/2605.12620#S3.p2.12),[§3](https://arxiv.org/html/2605.12620#S3.p3.7),[§5\.1](https://arxiv.org/html/2605.12620#S5.SS1.p3.1),[§6\.1](https://arxiv.org/html/2605.12620#S6.SS1.p3.2),[§6\.2](https://arxiv.org/html/2605.12620#S6.SS2.p5.1),[Table 2](https://arxiv.org/html/2605.12620#S6.T2.3.1),[Table 2](https://arxiv.org/html/2605.12620#S6.T2.5.2),[§8\.2](https://arxiv.org/html/2605.12620#S8.SS2.p1.1),[§8\.2](https://arxiv.org/html/2605.12620#S8.SS2.p4.1)\.
- \[51\]Y\. Yang, T\. Zhou, K\. Li, D\. Tao, L\. Li, L\. Shen, X\. He, J\. Jiang, and Y\. Shi\(2024\)Embodied multi\-modal agent trained by an llm from a parallel textworld\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 26275–26285\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1)\.
- \[52\]F\. Yu, A\. Gao, and B\. Wang\(2024\)Ovm, outcome\-supervised value models for planning in mathematical reasoning\.InFindings of the Association for Computational Linguistics: NAACL 2024,pp\. 858–875\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p2.1),[§4](https://arxiv.org/html/2605.12620#S4.p1.1)\.
- \[53\]M\. Zawalski, W\. Chen, K\. Pertsch, O\. Mees, C\. Finn, and S\. Levine\(2024\)Robotic control via embodied chain\-of\-thought reasoning\.In8th Annual Conference on Robot Learning,Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1),[§10\.1](https://arxiv.org/html/2605.12620#S10.SS1.p1.pic1.2.2.2.1.1.1.1),[§2](https://arxiv.org/html/2605.12620#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.12620#S4.SS1.p1.13)\.
- \[54\]S\. Zhai, H\. Bai, Z\. Lin, J\. Pan, P\. Tong, Y\. Zhou, A\. Suhr, S\. Xie, Y\. LeCun, Y\. Ma,et al\.\(2024\)Fine\-tuning large vision\-language models as decision\-making agents via reinforcement learning\.Advances in neural information processing systems37,pp\. 110935–110971\.Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p1.1)\.
- \[55\]L\. Zhang, A\. Hosseini, H\. Bansal, M\. Kazemi, A\. Kumar, and R\. Agarwal\(2025\)Generative verifiers: reward modeling as next\-token prediction\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.12620#S1.p3.1),[§2](https://arxiv.org/html/2605.12620#S2.p2.1),[§4\.2](https://arxiv.org/html/2605.12620#S4.SS2.p2.11),[§4](https://arxiv.org/html/2605.12620#S4.p1.1)\.
- \[56\]Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, Z\. Luo, Z\. Feng, and Y\. Ma\(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Bangkok, Thailand\.External Links:[Link](http://arxiv.org/abs/2403.13372)Cited by:[§9](https://arxiv.org/html/2605.12620#S9.p1.1)\.
- \[57\]L\. Zu, L\. Lin, S\. Fu, N\. Zhao, and P\. Zhou\(2025\)Collaborative tree search for enhancing embodied multi\-agent collaboration\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 29513–29522\.Cited by:[§2](https://arxiv.org/html/2605.12620#S2.p1.1)\.

\\thetitle

Supplementary Material

First, we provide additional details about the benchmarks \(App[8](https://arxiv.org/html/2605.12620#S8)\) and the training setup \(App[9](https://arxiv.org/html/2605.12620#S9)\)\. Then, we provide the prompts used for synthetic data generation in App[10](https://arxiv.org/html/2605.12620#S10)\. Further, qualitative examples of synthetic mistakes and verifications are available in App[11](https://arxiv.org/html/2605.12620#S11)\. Finally, we provide some qualitative examples of the outputs generated by our verifiers at test time in App[12](https://arxiv.org/html/2605.12620#S12)\.

## 8Details about Benchmarks

### 8\.1LangR

The LangR benchmark\[[45](https://arxiv.org/html/2605.12620#bib.bib2)\], built on the Habitat 2\.0\[[43](https://arxiv.org/html/2605.12620#bib.bib45)\]simulator, is designed to evaluate the generalization capability of embodied agents in household rearrangement scenarios\. Agents receive high\-level instructions and must execute tasks that involve manipulating objects \(pick, open, place\), searching for target items, and performing simple forms of logical reasoning such as conditional operations\. The allowed actions are:navigate\(receptacle\), pick\(object\), place\(object\), open\(receptacle\), close\(receptacle\)\.

The actions are executed only if certain preconditions are satisfied\. For example, an object can be picked only if it is within reach\. The object categories used are: ball, clamp, hammer, screwdriver, padlock, scissors, block, drill, spatula, knife, spoon, plate, sponge, cleanser, plum, pear, peach, apple, lemon, can, box, banana, strawberry, lego, rubriks cube, book, bowl, cup, fork\. The maximum number of steps for an episode is 32\.

The benchmark provides a suite of training tasks along with a separate set of held\-out test tasks\. The test split is constructed to examine multiple aspects of generalization, including previously unseen environments and novel instruction formulations\.

Two main forms of generalization are emphasized\. The first isparaphrastic robustness, where the agent must interpret varied rephrasings of an instruction that shares the same underlying goal\. The second isbehavioral generalization, which requires the agent to handle new types of reasoning that do not appear in the training distribution\. For example, during training the agent may learn to locate a specified number of object instances, while the multiple rearrangements task in the test set requires discovering all instances of a category without being told the exact count\. We describe the tasks below\.

Paraphrastic Robustness

- •Instruction Rephrasing:Same underlying goal expressed using a different wording than in training\.
- •Referring Expressions:Objects are mentioned through descriptive or visual attributes rather than their canonical names \(e\.g\., a banana described as a curved yellow fruit\)\.
- •Context:Objects are referred to within a situational or contextual description \(e\.g\., a ball described as a sports object\)\.
- •Irrelevant Instruction Text:Additional text is included that does not affect the task but may distract the agent\.

Behavioral Generalization

- •Multiple Rearrangements:Requires rearranging three objects, although training tasks involve only two\.
- •Novel Objects:Introduces new combinations of instructions and object categories that never co\-occur in training\.
- •Multiple Objects:Requires manipulating all instances of an object category\. The agent must search for and detect every instance, a concept not present in training\.
- •Conditional Instructions:Task outcome depends on whether a specified condition holds \(e\.g\., if the fridge is open, move the apple to it; otherwise move the orange, and only the required object\)\.

The benchmark also includes a spatial reasoning task that uses instructions such as ”place the object to the right of the black table\.” The action space available to the agent does not provide primitives for lateral \(left, right, forward, back\) movement, which prevents the agent from acquiring meaningful knowledge of the scene layout\. As a result, the task is not compatible with the defined action space, so we exclude it from our evaluation\.

### 8\.2EB\-ALFRED

The ALFRED\[[34](https://arxiv.org/html/2605.12620#bib.bib35)\]benchmark is built on top of the AI2THOR simulator\[[17](https://arxiv.org/html/2605.12620#bib.bib36)\]\. In this work, we use the EB\-ALFRED implementation from\[[50](https://arxiv.org/html/2605.12620#bib.bib10)\], which restructures the original tasks into categories designed to probe different aspects of out\-of\-distribution generalization\. The benchmark includes seven task types:*Pick & Place*,*Stack & Place*,*Pick Two & Place*,*Clean & Place*,*Heat & Place*,*Cool & Place*, and*Examine in Light*\. Agents operate using eight possible actions:pick up,open,close,turn on,turn off,slice,put down, andfind\.

EB\-ALFRED tasks are grouped into six subsets, each targeting a distinct skill or reasoning capability:

- •Base:Evaluates core task solving abilities needed to plan and execute low to medium complexity action sequences\.
- •Common Sense:Measures the use of indirect object references grounded in everyday knowledge \(for example, describing a refrigerator as ”a receptacle that can keep food fresh for several days”\) and tests the agent’s ability to apply such knowledge during instruction following\.
- •Complex Instruction:Contains longer contexts with both relevant and irrelevant details, assessing an agent’s ability to extract the intended instruction\.
- •Spatial Awareness:Refers to objects through spatial relations with other items, testing spatial grounding and relational reasoning\.
- •Visual Appearance:Requires identifying objects based on visual attributes such as color or shape\.
- •Long Horizon:Includes tasks requiring extended action sequences, typically more than 15 steps in EB\-ALFRED\.

To construct the benchmark,\[[50](https://arxiv.org/html/2605.12620#bib.bib10)\]used the*valid seen*split of ALFRED\. A set of 50 tasks with fewer than 15 steps was first selected, from which the common sense and complex instruction subsets were derived\. Another 50 tasks with more than 15 steps formed the long horizon subset\. Instances for the visual appearance and spatial awareness subsets were chosen directly from ALFRED based on language references to color, shape, or spatial relations\. In total, EB\-ALFRED contains 300 test instances, uniformly distributed across the six subsets \(50 per subset\)\.

## 9Training Details

We train both the policy and verifier using theLLaMAFactoryframework\[[56](https://arxiv.org/html/2605.12620#bib.bib62)\]\. Full finetuning is applied while keeping the vision encoder and projection module fixed\. All training runs are conducted on 8×\\timesNVIDIA L40 GPUs\. Training data is formatted as multi turn dialogues using thesharegptformat provided byLLaMAFactory\.

The inputs to the policy and verifier consist of prior images and actions rather than earlier chains of thought\. To generate data compatible with this interface, each trajectory from the dataset is decomposed into a set of sub conversations, one for each action step\. Consider an original trajectory of the form

I,o1,\(c1,a1\),o2,\(c2,a2\),…I,\\;o\_\{1\},\\;\(c\_\{1\},a\_\{1\}\),\\;o\_\{2\},\\;\(c\_\{2\},a\_\{2\}\),\\;\\dotswhereIIis the instruction,oio\_\{i\}are observations,cic\_\{i\}are chains of thought, andaia\_\{i\}are actions\. Theiith sub conversation contains the instruction along with all observations and executed actions up to stepii:

I,o1,a1,o2,a2,…,oi,\(ci,ai\)\.I,\\;o\_\{1\},\\;a\_\{1\},\\;o\_\{2\},\\;a\_\{2\},\\;\\dots,\\;o\_\{i\},\\;\(c\_\{i\},a\_\{i\}\)\.All chains of thought except the final one,cic\_\{i\}, are removed\. During training we compute loss only on the last assistant message of each sub conversation, implemented by settingmask\_historytoTrue\. Hyperparameters are provided in Table[6](https://arxiv.org/html/2605.12620#S9.T6)\.

Table 6:Hyperparameters used for training
## 10Prompts

### 10\.1Prompt for CoT data generation

Prompt to Generate Synthetic Chain\-of\-Thought \(adapted and modified from\[[53](https://arxiv.org/html/2605.12620#bib.bib25)\]\)[⬇](data:text/plain;base64,You're an expert reinforcement learning researcher. You've trained a policy for controlling a robot with an arm to move around and perform tasks in a household environment.
The robot successfully completed a task specified by the instruction: "{instruction}". For that purpose, the
robot executed a sequence of actions. Consecutive moves that were executed are the following:

insert original trajectory with instruction, images, actions


The attached images show what the robot was seeing at a given time step (corresponding to every <image> tag in the trajectory). You can use the images to understand what the robot was seeing, where it was, what it was holding, whether its actions were successful or not, etc.


## General Information and Instructions

1. possible actions are: pick(object), place_on_recep(receptacle), navigate(receptacle), open_fridge(), close_fridge(), open_cabinet(cabinet).
2. Possible objects are: ball, clamp, hammer, screwdriver, padlock, scissors, block, drill, spatula, knife, spoon, plate, sponse, cleanser, plum, pear, peach, apple, lemon, can, box, banana, strawberry, lego, rubrik's cube, book, bowl, cup, mug, orange, lid, toy airplane, wrench.
3. When exploring, select from possible receptacles: [cabinet, drawer 7, cabinet drawer 6, fridge, chair, black table, brown table, TV stand, sink, right counter, left counter]

## Your objective

I want you to annotate the given trajectory with reasoning. That is, for each step, I need to know not only which action should be chosen,
but importantly what reasoning justifies that action choice. I want you to be descriptive and include all the relevant information available.
The reasoning should include the task to complete, the remaining high-level steps, the high-level movements that should be executed and why they are required and any other relevant justification.


### Begin by describing the task

Start by giving an overview of the task. Make it more comprehensive than the simple instruction. Include the activity, 
the objects the robot interacts with, and their relative locations in the environment. Then, describe 
the high-level movements that were most likely executed, based on the task that was completed and the 
primitive movements that were executed. Also, for each high-level movement write a justification for why it should be executed. 
Write an answer for this part using markdown and natural language. Be descriptive and highlight all the relevant details, but ensure that your description is consistent with the trajectory that was  executed, specified above.

### List the reasonings for each step

Finally, for each step describe the reasoning that allows to determine the correct action. For each step describe the remaining part of the objective, 
the current progress, the objects that are still relevant for determining the plan, and the plan for the next steps, based on the available features. Start the reasoning from a high level and gradually add finer features. I need you to be descriptive and very precise. Ensure that the reasoning is consistent with the task and the executed trajectory. 
Write the answer for this part as a Python-executable dictionary. For every step in the initial trajectory there should be exactly one separate item of the form <step id>:<reasoning>. 
Do not group the answers. The final dictionary should have exactly the same set of integer keys as the assistant messages in the original_trajectory above.
The reasoning should be a single string that describes the reasoning in natural language and includes all the required features.

Each reasoning string should have the following form:
- Describe the full task that remains to be completed (but only describe what remains), and place it inside a tag <task>.
- Describe the complete high-level plan for completing the remaining task (the list of remaining high-level steps), and place it inside a tag <plan>.
- Describe why the chosen high-level step should be executed now, which features of the current environment influence that decision, and how it should be done. Place it within a tag <subtask_reason>. First provide appropriate reasoning, and then the name of the action chosen (and not, "I choose action XYZ because ..."; rather, "I need to do XYZ, hence I should choose this action")
- Describe the high-level step that should be executed now (chosen from the list of high-level steps), and place it inside a tag <subtask>.



## Reminders

Keep in mind that the robot might not know where a certain object is, and it may have to move around the environment to search for it. Searching might be one of your subtasks. 

Also, it's possible that the actions chosen by the robot are sometimes suboptimal, you should take that into account. 
Your output reasoning chains should be from the perspective of the assistant, about how it should reason before picking the next action.
Note that the small pebble like thing near the gripper is part of the gripper itself, not a separate item.
The reasoning should be such that first it states why a certain action should be chosen, and then specify that action ("I need to find XYZ, hence, I should do action A" and not "I should do action A. It will help me find XYZ").
Sometimes, the robot would select non-sensical actions, you don't necessarily have to justify such actions (for example by hallucinating objects that don't exist).
You can use information from all time steps to develop your plan, but your final output should be such that the output at time t doesn't assume information about the future (because the robot will think and act one step at a time).
Also, in the original trajectory, after every assistant action, there is a flag telling whether that action was successful or not. You can use this to inform your reasoning chain output. However, note that during deployment, the agent will not have access to this information directly. Instead, it must use its visual input to decide whether the action succeeded. Hence, in your output reasoning chains, when you say that a previous action was successful/failed, you must say that this is based on the image (you can use the ground truth success status, but don't say it in the output).
In the output reasoning trajectory, it is okay if the robot changes the original plan/subtasks as it obtains more information about the environment, or after failures, similar to how an intelligent agent would adapt as it works toward solving a task. The robot should also be capable of recovering from previous errors. Such changes should be clearly stated in the reasoning string.

## Task summary

Here is a breakdown of what needs to be done:

- Describe the task.
- Describe the high-level movements that were executed, based on the completed task and the listed features.
- Describe the plan for the solution that allowed the robot to complete the task successfully.
- For each step on the trajectory, describe the reasoning that leads to determining the correct action. The reasoning should be descriptive and precise. You should provide exactly one reasoning 
string for each ASSISTANT step on the trajectory specified by original_trajectory.
- Just return the dictionary directly, like:
```python
{{
  ...
}}
```
- At the very end of the response, write a single label FINISHED to indicate that the answer is complete.
)1You’reanexpertreinforcementlearningresearcher\.You’vetrainedapolicyforcontrollingarobotwithanarmtomovearoundandperformtasksinahouseholdenvironment\.2Therobotsuccessfullycompletedataskspecifiedbytheinstruction:"\{instruction\}"\.Forthatpurpose,the3robotexecutedasequenceofactions\.Consecutivemovesthatwereexecutedarethefollowing:45insertoriginaltrajectorywithinstruction,images,actions678Theattachedimagesshowwhattherobotwasseeingatagiventimestep\(correspondingtoevery<image\>taginthetrajectory\)\.Youcanusetheimagestounderstandwhattherobotwasseeing,whereitwas,whatitwasholding,whetheritsactionsweresuccessfulornot,etc\.91011\#\#GeneralInformationandInstructions12131\.possibleactionsare:pick\(object\),place\_on\_recep\(receptacle\),navigate\(receptacle\),open\_fridge\(\),close\_fridge\(\),open\_cabinet\(cabinet\)\.142\.Possibleobjectsare:ball,clamp,hammer,screwdriver,padlock,scissors,block,drill,spatula,knife,spoon,plate,sponse,cleanser,plum,pear,peach,apple,lemon,can,box,banana,strawberry,lego,rubrik’scube,book,bowl,cup,mug,orange,lid,toyairplane,wrench\.153\.Whenexploring,selectfrompossiblereceptacles:\[cabinet,drawer7,cabinetdrawer6,fridge,chair,blacktable,browntable,TVstand,sink,rightcounter,leftcounter\]1617\#\#Yourobjective1819Iwantyoutoannotatethegiventrajectorywithreasoning\.Thatis,foreachstep,Ineedtoknownotonlywhichactionshouldbechosen,20butimportantlywhatreasoningjustifiesthatactionchoice\.Iwantyoutobedescriptiveandincludealltherelevantinformationavailable\.21Thereasoningshouldincludethetasktocomplete,theremaininghigh\-levelsteps,thehigh\-levelmovementsthatshouldbeexecutedandwhytheyarerequiredandanyotherrelevantjustification\.222324\#\#\#Beginbydescribingthetask2526Startbygivinganoverviewofthetask\.Makeitmorecomprehensivethanthesimpleinstruction\.Includetheactivity,27theobjectstherobotinteractswith,andtheirrelativelocationsintheenvironment\.Then,describe28thehigh\-levelmovementsthatweremostlikelyexecuted,basedonthetaskthatwascompletedandthe29primitivemovementsthatwereexecuted\.Also,foreachhigh\-levelmovementwriteajustificationforwhyitshouldbeexecuted\.30Writeananswerforthispartusingmarkdownandnaturallanguage\.Bedescriptiveandhighlightalltherelevantdetails,butensurethatyourdescriptionisconsistentwiththetrajectorythatwasexecuted,specifiedabove\.3132\#\#\#Listthereasoningsforeachstep3334Finally,foreachstepdescribethereasoningthatallowstodeterminethecorrectaction\.Foreachstepdescribetheremainingpartoftheobjective,35thecurrentprogress,theobjectsthatarestillrelevantfordeterminingtheplan,andtheplanforthenextsteps,basedontheavailablefeatures\.Startthereasoningfromahighlevelandgraduallyaddfinerfeatures\.Ineedyoutobedescriptiveandveryprecise\.Ensurethatthereasoningisconsistentwiththetaskandtheexecutedtrajectory\.36WritetheanswerforthispartasaPython\-executabledictionary\.Foreverystepintheinitialtrajectorythereshouldbeexactlyoneseparateitemoftheform<stepid\>:<reasoning\>\.37Donotgrouptheanswers\.Thefinaldictionaryshouldhaveexactlythesamesetofintegerkeysastheassistantmessagesintheoriginal\_trajectoryabove\.38Thereasoningshouldbeasinglestringthatdescribesthereasoninginnaturallanguageandincludesalltherequiredfeatures\.3940Eachreasoningstringshouldhavethefollowingform:41\-Describethefulltaskthatremainstobecompleted\(butonlydescribewhatremains\),andplaceitinsideatag<task\>\.42\-Describethecompletehigh\-levelplanforcompletingtheremainingtask\(thelistofremaininghigh\-levelsteps\),andplaceitinsideatag<plan\>\.43\-Describewhythechosenhigh\-levelstepshouldbeexecutednow,whichfeaturesofthecurrentenvironmentinfluencethatdecision,andhowitshouldbedone\.Placeitwithinatag<subtask\_reason\>\.Firstprovideappropriatereasoning,andthenthenameoftheactionchosen\(andnot,"IchooseactionXYZbecause\.\.\.";rather,"IneedtodoXYZ,henceIshouldchoosethisaction"\)44\-Describethehigh\-levelstepthatshouldbeexecutednow\(chosenfromthelistofhigh\-levelsteps\),andplaceitinsideatag<subtask\>\.45464748\#\#Reminders4950Keepinmindthattherobotmightnotknowwhereacertainobjectis,anditmayhavetomovearoundtheenvironmenttosearchforit\.Searchingmightbeoneofyoursubtasks\.5152Also,it’spossiblethattheactionschosenbytherobotaresometimessuboptimal,youshouldtakethatintoaccount\.53Youroutputreasoningchainsshouldbefromtheperspectiveoftheassistant,abouthowitshouldreasonbeforepickingthenextaction\.54Notethatthesmallpebblelikethingnearthegripperispartofthegripperitself,notaseparateitem\.55Thereasoningshouldbesuchthatfirstitstateswhyacertainactionshouldbechosen,andthenspecifythataction\("IneedtofindXYZ,hence,IshoulddoactionA"andnot"IshoulddoactionA\.ItwillhelpmefindXYZ"\)\.56Sometimes,therobotwouldselectnon\-sensicalactions,youdon’tnecessarilyhavetojustifysuchactions\(forexamplebyhallucinatingobjectsthatdon’texist\)\.57Youcanuseinformationfromalltimestepstodevelopyourplan,butyourfinaloutputshouldbesuchthattheoutputattimetdoesn’tassumeinformationaboutthefuture\(becausetherobotwillthinkandactonestepatatime\)\.58Also,intheoriginaltrajectory,aftereveryassistantaction,thereisaflagtellingwhetherthatactionwassuccessfulornot\.Youcanusethistoinformyourreasoningchainoutput\.However,notethatduringdeployment,theagentwillnothaveaccesstothisinformationdirectly\.Instead,itmustuseitsvisualinputtodecidewhethertheactionsucceeded\.Hence,inyouroutputreasoningchains,whenyousaythatapreviousactionwassuccessful/failed,youmustsaythatthisisbasedontheimage\(youcanusethegroundtruthsuccessstatus,butdon’tsayitintheoutput\)\.59Intheoutputreasoningtrajectory,itisokayiftherobotchangestheoriginalplan/subtasksasitobtainsmoreinformationabouttheenvironment,orafterfailures,similartohowanintelligentagentwouldadaptasitworkstowardsolvingatask\.Therobotshouldalsobecapableofrecoveringfrompreviouserrors\.Suchchangesshouldbeclearlystatedinthereasoningstring\.6061\#\#Tasksummary6263Hereisabreakdownofwhatneedstobedone:6465\-Describethetask\.66\-Describethehigh\-levelmovementsthatwereexecuted,basedonthecompletedtaskandthelistedfeatures\.67\-Describetheplanforthesolutionthatallowedtherobottocompletethetasksuccessfully\.68\-Foreachsteponthetrajectory,describethereasoningthatleadstodeterminingthecorrectaction\.Thereasoningshouldbedescriptiveandprecise\.Youshouldprovideexactlyonereasoning69stringforeachASSISTANTsteponthetrajectoryspecifiedbyoriginal\_trajectory\.70\-Justreturnthedictionarydirectly,like:71‘‘‘python72\{\{73\.\.\.74\}\}75‘‘‘76\-Attheveryendoftheresponse,writeasinglelabelFINISHEDtoindicatethattheansweriscomplete\.

### 10\.2Prompts for Synthetic Failed Trajectory Generation

Prompt to Generate Synthetic Mistakes and Verifications[⬇](data:text/plain;base64,R2l2ZW4gdGhlIGFib3ZlIGNvbnZlcnNhdGlvbiwgeW91ciBqb2IgaXMgdG8gY3JlYXRlIGEgbmV3IHRyYWplY3Rvcnkgd2hlcmUgdGhlIGFnZW50IG1ha2VzIGEgbWlzdGFrZS4gRG9uJ3QgY2hhbmdlIHRoZSBpbnN0cnVjdGlvbiwganVzdCBpbWFnaW5lIHNvbWUgbWlzdGFrZXMgYW5kIGluc2VydCB0aGVtIGludG8gdGhlIG5ldyB0cmFqZWN0b3J5LiAKCllvdXIgbWlzdGFrZXMgc2hvdWxkIGJlIHJlYWxpc3RpYyBhbmQgcGxhdXNpYmxlLCBpLmUuLCBzb21ldGhpbmcgdGhhdCB0aGUgYWdlbnQgaXMgYWN0dWFsbHkgbGlrZWx5IHRvIGRvLCBhbmQgbm90IGEgcmFuZG9tIG1pc3Rha2UuIEZvciBleGFtcGxlOiAKLSBUaGUgYWdlbnQgbWlzdW5kZXJzdGFuZHMgdGhlIGluc3RydWN0aW9uLiBpdCBpbnRlcmFjdHMgd2l0aCB0aGUgaW5jb3JyZWN0IG9iamVjdCwgb3Igd2l0aCBpbmNvcnJlY3QgcmVjZXB0YWNsZXMsIG9yIGRvZXNuJ3QgcGVyZm9ybSB0aGUgZXhwZWN0ZWQgb3BlcmF0aW9ucyBvbiB0aGUgb2JqZWN0cy9yZWNlcHRhY2xlcy4gSGVyZSwgdHJ5IHRvIG1ha2UgdGhlICJpbmNvcnJlY3QiIGFjdGlvbnMgc3RpbGwgc29tZXdoYXQgY2xvc2UgdG8gdGhlIGNvcnJlY3QgYWN0aW9ucywgZS5nLiwgcmVwbGFjaW5nIGFwcGxlIHdpdGggb3JhbmdlIGluc3RlYWQgb2YgcmVtb3RlLCBvciByZXBsYWNpbmcgdHVybiBvbiB3aXRoIHR1cm4gYW5vdGhlciByZWxhdGVkIGFjdGlvbiBpbnN0ZWFkIG9mIHB1dCBkb3duIHRoZSBvYmplY3QuIAotIFRoZSBhZ2VudCBkb2Vzbid0IGNvbXBsZXRlIGEgcHJlY29uZGl0aW9uIGZvciB0aGUgbmV4dCBhY3Rpb24sIGUuZy4sIGl0IGRvZXMgbm90ICdmaW5kJyB0aGUgb2JqZWN0IGJlZm9yZSBpbnRlcmFjdGluZyB3aXRoIGl0LCBvciBkb2Vzbid0ICdvcGVuJyB0aGUgbWljcm93YXZlIG9yIGNhYmluZXQgYmVmb3JlIHB1dHRpbmcgdGhlIG9iamVjdCBpbnNpZGUuIHRoaXMgb25lIGlzIHF1aXRlIGltcG9ydGFudCAtLSBiYXNpY2FsbHksIHRoaW5rIG9mIHdheXMgaW4gd2hpY2ggdGhlIGFnZW50IG1pZ2h0IGJlIGluY29tcGV0ZW50OiBldmVuIGlmIGl0IHVuZGVyc3Rvb2QgdGhlIHRhc2sgY29ycmVjdGx5LCBpdCBtaWdodCBmYWlsIHRvIHBlcmZvcm0gaXQgcHJvcGVybHkKCi0gVGhlIGFnZW50IG1lc3NlcyB1cCB0aGUgb3JkZXIgaW4gd2hpY2ggYWN0aW9ucyBzaG91bGQgYmUgZG9uZS4gZm9yIGV4YW1wbGUsIGlmIGl0IGhhcyB0byBjbGVhbiBzb21ldGhpbmcsIGl0IHB1dHMgdGhlIG9iamVjdCBpbiB0aGUgc2luaywgdGFrZXMgaXQgb3V0LCBhbmQgdGhlbiB0dXJucyBvbiB0aGUgZmF1Y2V0CgotIFRoZSBhZ2VudCBnZXRzIGNvbmZ1c2VkIGJldHdlZW4gZGlmZmVyZW50IGluc3RhbmNlcyBvZiB0aGUgc2FtZSBvYmplY3QKCi0gVGhlIGFnZW50IGNvbXBsZXRlcyBvbmx5IG9uZSBwYXJ0IG9mIHRoZSBpbnN0cnVjdGlvbiBhbmQgbm90IHRoZSBmdWxsIGluc3RydWN0aW9uCgotIFRoZSBhZ2VudCBkb2Vzbid0IGRvIGFsbCB0aGUgYWN0aW9ucyByZXF1aXJlZCBmb3IgYSBzdWJnb2FsLCBmb3IgZXhhbXBsZSwgaXQgY2xlYW5zIGFuIG9iamVjdCBpbiB0aGUgc2luaywgYnV0IGRvZXMgbm90IHBpY2sgaXQgdXAgYmVmb3JlIHB1dHRpbmcgaXQgc29tZXdoZXJlIGVsc2UuCgoKS2VlcCBpbiBtaW5kIHRoYXQgdGhlIG9yaWdpbmFsIHRyYWplY3RvcnkgaXMgc3VjY2Vzc2Z1bCwgc28gaWYgeW91IGluc2VydCBhIG1pc3Rha2UgdGhlcmUsIHlvdXIgYWN0aW9uIHNob3VsZCBiZSBkaWZmZXJlbnQgZnJvbSB0aGUgYWN0aW9uIGdpdmVuIGluIHRoZSBvcmlnaW5hbCB0cmFqZWN0b3J5LCBiZWNhdXNlIG90aGVyd2lzZSB0aGF0IGFjdGlvbiB3b3VsZCBiZSBjb3JyZWN0LgoKVGhlIG91dHB1dCBzaG91bGQgYWxzbyBiZSBpbiBhIHNpbWlsYXIganNvbiBmb3JtYXQuIGFkZGl0aW9uYWxseSwgZm9yIGV2ZXJ5IGFjdGlvbiBpbiB0aGUgb3V0cHV0LCBhZGQgYSBqdWRnZW1lbnQgb24gd2hldGhlciB0aGUgYWN0aW9uIGlzIGNvcnJlY3Qgb3IgaW5jb3JyZWN0LiBmaXJzdCwgcHJvdmlkZSBkZXRhaWxlZCByZWFzb25pbmcsIGFuZCB0aGVuIHRoZSB2ZXJkaWN0ICgiYWN0aW9uX2lzX2NvcnJlY3Q6IHllcyBvciBubyIpLiB0aGUgdmVyZGljdCBhbmQganVkZ2VtZW50IHNob3VsZCBiZSBhIHNpbmdsZSBzdHJpbmcgYW5kIHNob3VsZCBnbyBpbiBhIHNlcGFyYXRlIGtleSBpbiB0aGUgZGljdCBjYWxsZWQgJ3ZlcmlmaWNhdGlvbicuIGFsc28sIGl0J3MgcG9zc2libGUgdGhhdCBzb21lIGFjdGlvbnMgYXJlIGNvcnJlY3QgYW5kIG90aGVycyBhcmUgaW5jb3JyZWN0IGluIHRoZSBzYW1lIGNvbnZlcnNhdGlvbi4gCgpBbHNvLCBwZXIgY29udmVyc2F0aW9uLCBhZGQgYSBtaXN0YWtlX2Rlc2NyaXB0aW9uIGZpZWxkLCBnaXZpbmcgYSBzdW1tYXJ5IG9mIHdoYXQgdGhlIG1pc3Rha2Ugd2FzLgpNYWtlIHN1cmUgdGhhdCBpbiB0aGUgb3V0cHV0LCBiZWZvcmUgZXZlcnkgYXNzaXN0YW50IG1lc3NhZ2UgdGhlcmUgaXMgYSB1c2VyIG1lc3NhZ2UuCmZvciBleGFtcGxlOgoKPG5ld190cmFqZWN0b3J5PiAKbWlzdGFrZV9zdW1tYXJ5OiA8aW5zZXJ0IG1pc3Rha2Ugc3VtbWFyeT4gCgp7CiJjb250ZW50IjogIi4uLiIsCiJyb2xlIjogInVzZXIiLAp9LAp7CiJjb250ZW50IjogIi4uLiIsCiJyb2xlIjogImFzc2lzdGFudCIsCiJpc19hY3Rpb25fc3VjY2Vzc2Z1bCI6IHRydWUsCiJ2ZXJpZmljYXRpb24iOiAiPGluc2VydCByZWFzb25pbmcgaGVyZT5cbgppc19hY3Rpb25fY29ycmVjdDogWCIgd2hlcmUgWCBpcyB5ZXMgb3Igbm8sIAp9LAoKCioqSU1QT1JUQU5UKio6IAotIEp1ZGdlIGV2ZXJ5IGFjdGlvbiBvbiB3aGV0aGVyIGl0IGFkdmFuY2VzIHRoZSB0YXNrIG9yIG5vdC4gc28gaXQgaXMgcG9zc2libGUgdGhhdCB0aGUgcmVhc29uaW5nIGZvciB0aGUgYWN0aW9uIGlzIGluY29ycmVjdCwgYnV0IGl0IHN0aWxsIGFkdmFuY2VzIHRoZSB0YXNrLCBpbiB0aGF0IGNhc2UsIHRoZSBhY3Rpb24gaXMgY29ycmVjdC4gCi0gVGhlIG1pc3Rha2VzIHNob3VsZCBzdGFydCBmcm9tIHRoZSB0YXNrIGludGVycHJldGF0aW9uLCBwbGFubmluZywgYW5kIHN1YnRhc2sgcmVhc29uaW5nIGl0c2VsZiwgYW5kIG5vdCBsaWtlIGl0IHVuZGVyc3RhbmRzIHRoZSB0YXNrIGJ1dCB0aGUgc3VidGFzay9hY3Rpb24gYXJlIG5vdCBhbGlnbmVkIHdpdGggdGhlIHJlYXNvbmluZy4gCi0gS2VlcCBpbiBtaW5kIHRoYXQgdGhlIG91dHB1dCBjb252ZXJzYXRpb24gaXMgZnJvbSB0aGUgcGVyc3BlY3RpdmUgb2YgdGhlIGFnZW50LCBhbmQgdGhlIGFnZW50IGRvZXNuJ3Qga25vdyBpdCdzIG1ha2luZyBhIG1pc3Rha2UuIE1ha2Ugc3VyZSB0aGF0IGluIHRoZSBvdXRwdXQgdGhlIGFnZW50IHNob3VsZCBub3Qgc2VlbSB0byBrbm93IHRoYXQgaXRzIHBsYW5zIGFuZCBhY3Rpb25zIGFyZSBub3QgYWxpZ25lZCB3aXRoIHRoZSBpbnN0cnVjdGlvbi4gVGhlIGFnZW50IHNob3VsZCB0aGluayBpdHMgZG9pbmcgdGhlIHJpZ2h0IHRoaW5nLCBzbyBpdCBkb2Vzbid0IGV2ZW4ga25vdyB0aGF0IGl0IG1pc3VuZGVyc3Rvb2QgdGhlIGluc3RydWN0aW9uLiBzbyB0aGVyZSBzaG91bGQgbm90IGJlIGFueXRoaW5nIGxpa2UgImRvaW5nIFggZXZlbiB0aG91Z2ggdGhlIGluc3RydWN0aW9uIHNhaWQgWSIgb3IgIm1pc3Rha2VubHkgZG9pbmcgWCIuIAotIFRyeSB0byBtYWtlIGEgJ3JlYWxpc3RpYycgbWlzdGFrZSAtIGdvIGZvciBhbiBvYmplY3QgdGhhdCBpcyBzb21ld2hhdCByZWxhdGVkIHRvIHRoZSBhY3R1YWwgb2JqZWN0IG1lbnRpb25lZCBpbiB0aGUgaW5zdHJ1Y3Rpb24uIAotIEluIHRoZSB2ZXJpZmljYXRpb24sIHRyeSB0byBleHBsYWluIHlvdXIgcmVhc29uaW5nIGluIGFzIG11Y2ggZGV0YWlsIGFzIHBvc3NpYmxlLCB3cml0aW5nIG91dCB3aGljaCBvYmplY3Qgd2FzIHRhcmdldGVkIGFuZCB3aGljaCBzaG91bGQgaGF2ZSBiZWVuIHRhcmdldGVkIGluc3RlYWQuIAotIFRyeSB0byBnZW5lcmF0ZSBkaXZlcnNlIHR5cGVzIG9mIG1pc3Rha2VzLiBZb3UgZG9uJ3QgaGF2ZSB0byByZXN0cmljdCB5b3Vyc2VsZiB0byB0aGUgbWlzdGFrZXMgbWVudGlvbmVkIGFib3ZlLCBmZWVsIGZyZWUgdG8gY29tZSB1cCB3aXRoIG90aGVyIHJlYWxpc3RpYyBmYWlsdXJlIG1vZGVzLiAKLSBLZWVwIGluIG1pbmQgdGhhdCB0aGUgb3JpZ2luYWwgdHJhamVjdG9yeSBpcyBzdWNjZXNzZnVsLCBzbyBpZiB5b3UgaW5zZXJ0IGEgbWlzdGFrZSB0aGVyZSwgeW91ciBhY3Rpb24gc2hvdWxkIGJlIGRpZmZlcmVudCBmcm9tIHRoZSBhY3Rpb24gZ2l2ZW4gaW4gdGhlIG9yaWdpbmFsIHRyYWplY3RvcnksIGJlY2F1c2Ugb3RoZXJ3aXNlIHRoYXQgYWN0aW9uIHdvdWxkIGJlIGNvcnJlY3QuIAotIFNpbXBseSB1c2luZyBhIGRpZmZlcmVudCBvYmplY3QvcmVjZXB0YWNsZS9hY3Rpb24gbWlnaHQgbm90IGJlIGNvbnNpZGVyZWQgYSBtaXN0YWtlLiBGb3IgZXhhbXBsZSwgaWYgYSBsYW1wIGlzIHJlcXVpcmVkLCBib3RoIGZsb29ybGFtcCBhbmQgZGVza2xhbXAgYXJlIGNvcnJlY3QuIAotIE1ha2Ugc3VyZSB0aGF0IGV2ZXJ5IGFzc2lzdGFudCBtZXNzYWdlIGhhcyBhbGwgdGhlIHJlbGV2YW50IHRhZ3MgKDx0YXNrPiwgPHBsYW4+LCA8c3VidGFzaz4sIGV0Yy4sIGFuZCBlc3BlY2lhbGx5IHRoZSA8YWN0aW9uPiB0YWcpIAotIEluIHRoZSB2ZXJpZmljYXRpb24sIGZpcnN0IHRyeSB0byByZWFzb24gYWJvdXQgd2hhdCB0aGUgaW5zdHJ1Y3Rpb24gc3BlY2lmaWVzIGFuZCB0aGUgYWN0aW9uIHRha2VuIGJ5IHRoZSBhZ2VudCwgaG93IHRoZSB0d28gYXJlIHJlbGF0ZWQsIHdoYXQgdGhlIGFnZW50IHNob3VsZCBoYXZlIGRvbmUsIGFuZCBkZXNjcmliZSBpbiBkZXRhaWwgd2h5IHRoZSBhY3Rpb24gaXMgaW5jb3JyZWN0LiAKLSBFdmVyeSB2ZXJpZmljYXRpb24gd2lsbCBiZSB1c2VkIGluZGVwZW5kZW50bHkgb2YgZWFjaCBvdGhlciwgc28gZXZlcnkgdmVyaWZpY2F0aW9uIHNob3VsZCBiZSBjb21wbGV0ZSBpbiBpdHNlbGYgKHNob3VsZG4ndCBhc3N1bWUgZXhpc3RlbmNlIG9mIHByZXZpb3VzIHZlcmlmaWNhdGlvbnMpLiAKCkRvdWJsZSBjaGVjayBmb3IgdGhlc2UgaW1wb3J0YW50IHBvaW50cyBiZWZvcmUgd3JpdGluZyBvdXQgdGhlIG91dHB1dCBjb252ZXJzYXRpb24uIEl0IGlzIGNydWNpYWwgdGhhdCB0aGVzZSBpbnN0cnVjdGlvbnMgYXJlIGZvbGxvd2VkLiBTbyBjaGVjayByaWdvcm91c2x5IGFuZCBzdHJpY3RseSB5b3VyIG91dHB1dCwgYW5kIHJldmlzZSB0aGVtIHRvIGZvbGxvdyB0aGVzZSBpbnN0cnVjdGlvbnMgaWYgbmVjZXNzYXJ5Lg==)1Giventheaboveconversation,yourjobistocreateanewtrajectorywheretheagentmakesamistake\.Don’tchangetheinstruction,justimaginesomemistakesandinsertthemintothenewtrajectory\.23Yourmistakesshouldberealisticandplausible,i\.e\.,somethingthattheagentisactuallylikelytodo,andnotarandommistake\.Forexample:4\-Theagentmisunderstandstheinstruction\.itinteractswiththeincorrectobject,orwithincorrectreceptacles,ordoesn’tperformtheexpectedoperationsontheobjects/receptacles\.Here,trytomakethe"incorrect"actionsstillsomewhatclosetothecorrectactions,e\.g\.,replacingapplewithorangeinsteadofremote,orreplacingturnonwithturnanotherrelatedactioninsteadofputdowntheobject\.5\-Theagentdoesn’tcompleteapreconditionforthenextaction,e\.g\.,itdoesnot’find’theobjectbeforeinteractingwithit,ordoesn’t’open’themicrowaveorcabinetbeforeputtingtheobjectinside\.thisoneisquiteimportant\-\-basically,thinkofwaysinwhichtheagentmightbeincompetent:evenifitunderstoodthetaskcorrectly,itmightfailtoperformitproperly67\-Theagentmessesuptheorderinwhichactionsshouldbedone\.forexample,ifithastocleansomething,itputstheobjectinthesink,takesitout,andthenturnsonthefaucet89\-Theagentgetsconfusedbetweendifferentinstancesofthesameobject1011\-Theagentcompletesonlyonepartoftheinstructionandnotthefullinstruction1213\-Theagentdoesn’tdoalltheactionsrequiredforasubgoal,forexample,itcleansanobjectinthesink,butdoesnotpickitupbeforeputtingitsomewhereelse\.141516Keepinmindthattheoriginaltrajectoryissuccessful,soifyouinsertamistakethere,youractionshouldbedifferentfromtheactiongivenintheoriginaltrajectory,becauseotherwisethatactionwouldbecorrect\.1718Theoutputshouldalsobeinasimilarjsonformat\.additionally,foreveryactionintheoutput,addajudgementonwhethertheactioniscorrectorincorrect\.first,providedetailedreasoning,andthentheverdict\("action\_is\_correct:yesorno"\)\.theverdictandjudgementshouldbeasinglestringandshouldgoinaseparatekeyinthedictcalled’verification’\.also,it’spossiblethatsomeactionsarecorrectandothersareincorrectinthesameconversation\.1920Also,perconversation,addamistake\_descriptionfield,givingasummaryofwhatthemistakewas\.21Makesurethatintheoutput,beforeeveryassistantmessagethereisausermessage\.22forexample:2324<new\_trajectory\>25mistake\_summary:<insertmistakesummary\>2627\{28"content":"\.\.\.",29"role":"user",30\},31\{32"content":"\.\.\.",33"role":"assistant",34"is\_action\_successful":true,35"verification":"<insertreasoninghere\>\\n36is\_action\_correct:X"whereXisyesorno,37\},383940\*\*IMPORTANT\*\*:41\-Judgeeveryactiononwhetheritadvancesthetaskornot\.soitispossiblethatthereasoningfortheactionisincorrect,butitstilladvancesthetask,inthatcase,theactioniscorrect\.42\-Themistakesshouldstartfromthetaskinterpretation,planning,andsubtaskreasoningitself,andnotlikeitunderstandsthetaskbutthesubtask/actionarenotalignedwiththereasoning\.43\-Keepinmindthattheoutputconversationisfromtheperspectiveoftheagent,andtheagentdoesn’tknowit’smakingamistake\.Makesurethatintheoutputtheagentshouldnotseemtoknowthatitsplansandactionsarenotalignedwiththeinstruction\.Theagentshouldthinkitsdoingtherightthing,soitdoesn’tevenknowthatitmisunderstoodtheinstruction\.sothereshouldnotbeanythinglike"doingXeventhoughtheinstructionsaidY"or"mistakenlydoingX"\.44\-Trytomakea’realistic’mistake\-goforanobjectthatissomewhatrelatedtotheactualobjectmentionedintheinstruction\.45\-Intheverification,trytoexplainyourreasoninginasmuchdetailaspossible,writingoutwhichobjectwastargetedandwhichshouldhavebeentargetedinstead\.46\-Trytogeneratediversetypesofmistakes\.Youdon’thavetorestrictyourselftothemistakesmentionedabove,feelfreetocomeupwithotherrealisticfailuremodes\.47\-Keepinmindthattheoriginaltrajectoryissuccessful,soifyouinsertamistakethere,youractionshouldbedifferentfromtheactiongivenintheoriginaltrajectory,becauseotherwisethatactionwouldbecorrect\.48\-Simplyusingadifferentobject/receptacle/actionmightnotbeconsideredamistake\.Forexample,ifalampisrequired,bothfloorlampanddesklamparecorrect\.49\-Makesurethateveryassistantmessagehasalltherelevanttags\(<task\>,<plan\>,<subtask\>,etc\.,andespeciallythe<action\>tag\)50\-Intheverification,firsttrytoreasonaboutwhattheinstructionspecifiesandtheactiontakenbytheagent,howthetwoarerelated,whattheagentshouldhavedone,anddescribeindetailwhytheactionisincorrect\.51\-Everyverificationwillbeusedindependentlyofeachother,soeveryverificationshouldbecompleteinitself\(shouldn’tassumeexistenceofpreviousverifications\)\.5253Doublecheckfortheseimportantpointsbeforewritingouttheoutputconversation\.Itiscrucialthattheseinstructionsarefollowed\.Socheckrigorouslyandstrictlyyouroutput,andrevisethemtofollowtheseinstructionsifnecessary\.

## 11Examples of Synthetic Incorrect Actions

Here, we show some examples of trajectories with synthetic mistakes generated by the LLM to train the verifier\.

### 11\.1LangR

Example 1 \(LangR\): The agent erroneously picks up the bowl even though the instruction specifies a cleaner\.[⬇](data:text/plain;base64,SW5zdHJ1Y3Rpb246IFNoaWZ0IGEgY2xlYW5zZXIgYW5kIHBsYWNlIGl0IGludG8gdGhlIHNwZWNpZmllZCBibGFjayB0YWJsZS4KCgphY3Rpb24gMTogIjx0YXNrPk1vdmUgdGhlIGJvd2wgb3ZlciB0byB0aGUgYmxhY2sgdGFibGU8L3Rhc2s+XG48cGxhbj4xLiBHbyB0byByaWdodCBraXRjaGVuIGNvdW50ZXIgd2hlcmUgdGhlIGJvd2wgdXN1YWxseSBzaXRzLlxuMi4gUGljayB1cCB0aGUgYm93bC5cbjMuIE5hdmlnYXRlIHRvIHRoZSBibGFjayB0YWJsZS5cbjQuIFBsYWNlIHRoZSBib3dsIG9uIHRoZSBibGFjayB0YWJsZS48L3BsYW4+XG48c3VidGFzaz5OYXZpZ2F0ZSB0byB0aGUgcmlnaHQga2l0Y2hlbiBjb3VudGVyLjwvc3VidGFzaz5cbjxzdWJ0YXNrX3JlYXNvbj5UaGUgYm93bCBpcyBvZnRlbiBsZWZ0IG9uIHRoZSByaWdodCBzaWRlIG9mIHRoZSBraXRjaGVuIGNvdW50ZXIsIHNvIEkgd2lsbCBoZWFkIHRoZXJlIGZpcnN0Ljwvc3VidGFza19yZWFzb24+XG48YWN0aW9uPignbmF2KHJlY2VwdGFjbGVfYWFiYl9jb3VudGVyX3JpZ2h0X2tpdGNoZW5fY291bnRlciknLCAxMCk8L2FjdGlvbj4iCgp2ZXJpZmljYXRpb246ICJUaGUgaW5zdHJ1Y3Rpb24gYXNrcyBmb3IgcmVsb2NhdGluZyBhIGNsZWFuc2VyLCBidXQgdGhlIGFzc2lzdGFudCBwbGFucyB0byBtb3ZlIGEgYm93bCBpbnN0ZWFkLiBOZXZlcnRoZWxlc3MsIG5hdmlnYXRpbmcgdG8gYSBraXRjaGVuIGNvdW50ZXIgY291bGQgbG9naWNhbGx5IGJlIGEgZmlyc3Qgc3RlcCBmb3IgZmluZGluZyBhIGNsZWFuc2VyIGFzIHdlbGwsIHNvIHRoaXMgbW90aW9uIGRvZXMgbm90IGRpcmVjdGx5IHZpb2xhdGUgdGFzayBwcm9ncmVzcy5cbmFjdGlvbl9pc19jb3JyZWN0OiB5ZXMiLAoKPT09PT0KCmFjdGlvbiAyOiAiPHRhc2s+TW92ZSB0aGUgYm93bCBvdmVyIHRvIHRoZSBibGFjayB0YWJsZTwvdGFzaz5cbjxwbGFuPlsgUGljayBib3dsIC0tPiBHbyB0byBibGFjayB0YWJsZSAtLT4gUGxhY2UgYm93bCBdPC9wbGFuPlxuPHN1YnRhc2s+UGljayB1cCB0aGUgYm93bC48L3N1YnRhc2s+XG48c3VidGFza19yZWFzb24+SSBjYW4gc2VlIHRoZSBib3dsIG9uIHRoZSBjb3VudGVyIGFuZCB3aWxsIGdyYWIgaXQuPC9zdWJ0YXNrX3JlYXNvbj5cbjxhY3Rpb24+KCdwaWNrX2Jvd2wocm9ib3RfMCknLCA0Mik8L2FjdGlvbj4iLAoKdmVyaWZpY2F0aW9uOiAiVGhlIHBpY2tlZCBvYmplY3QgaXMgYSBib3dsLCBub3QgdGhlIHJlcXVlc3RlZCBjbGVhbnNlciwgc28gdGhpcyBhY3Rpb24gZG9lcyBub3QgYWR2YW5jZSB0aGUgYWN0dWFsIHRhc2suXG5hY3Rpb25faXNfY29ycmVjdDogbm8iCgo8Zm9sbG93aW5nIGFjdGlvbnMgb21pdHRlZCBmb3IgY2xhcml0eT4K)1Instruction:Shiftacleanserandplaceitintothespecifiedblacktable\.234action1:"<task\>Movethebowlovertotheblacktable</task\>\\n<plan\>1\.Gotorightkitchencounterwherethebowlusuallysits\.\\n2\.Pickupthebowl\.\\n3\.Navigatetotheblacktable\.\\n4\.Placethebowlontheblacktable\.</plan\>\\n<subtask\>Navigatetotherightkitchencounter\.</subtask\>\\n<subtask\_reason\>Thebowlisoftenleftontherightsideofthekitchencounter,soIwillheadtherefirst\.</subtask\_reason\>\\n<action\>\(’nav\(receptacle\_aabb\_counter\_right\_kitchen\_counter\)’,10\)</action\>"56verification:"Theinstructionasksforrelocatingacleanser,buttheassistantplanstomoveabowlinstead\.Nevertheless,navigatingtoakitchencountercouldlogicallybeafirststepforfindingacleanseraswell,sothismotiondoesnotdirectlyviolatetaskprogress\.\\naction\_is\_correct:yes",78=====910action2:"<task\>Movethebowlovertotheblacktable</task\>\\n<plan\>\[Pickbowl\-\-\>Gotoblacktable\-\-\>Placebowl\]</plan\>\\n<subtask\>Pickupthebowl\.</subtask\>\\n<subtask\_reason\>Icanseethebowlonthecounterandwillgrabit\.</subtask\_reason\>\\n<action\>\(’pick\_bowl\(robot\_0\)’,42\)</action\>",1112verification:"Thepickedobjectisabowl,nottherequestedcleanser,sothisactiondoesnotadvancetheactualtask\.\\naction\_is\_correct:no"1314<followingactionsomittedforclarity\>

Example 2 \(LangR\): The agent erroneously navigates to Table 1 instead of the TV stand after picking up the object[⬇](data:text/plain;base64,SW5zdHJ1Y3Rpb246IFBsYWNlIGJvdGggYSBwbHVtIGFuZCBhIHBsYXRlIG9udG8gdGhlIFRWIHN0YW5kLgoKYWN0aW9uIDE6ICI8dGFzaz5QbGFjZSBhIHBsdW0gYW5kIGEgcGxhdGUgb24gdGhlIFRWIHN0YW5kLjwvdGFzaz4gPHBsYW4+WyduYXZpZ2F0ZSB0byBzaW5rIGNvdW50ZXInLCAncGljayBwbHVtJywgJ25hdmlnYXRlIHRvd2FyZCBUVi1zdGFuZCBhcmVhJywgJ3BsYWNlIHBsdW0nLCAnc2VhcmNoIGZvciBwbGF0ZScsICdwbGFjZSBwbGF0ZSddPC9wbGFuPiA8c3VidGFzaz5OYXZpZ2F0ZSB0byBzaW5rIGNvdW50ZXI8L3N1YnRhc2s+IDxzdWJ0YXNrX3JlYXNvbj5UaGUgcGx1bSBpcyBub3JtYWxseSBrZXB0IG9uIHRoZSBraXRjaGVuIGNvdW50ZXIgbmVhciB0aGUgc2luaywgc28gdGhlIGZpcnN0IHN0ZXAgaXMgdG8gbW92ZSB0aGVyZS48L3N1YnRhc2tfcmVhc29uPiA8YWN0aW9uPignbmF2KHJlY2VwdGFjbGVfYWFiYl9zaW5rX2tpdGNoZW5fY291bnRlciknLCA5KTwvYWN0aW9uPiIsCgp2ZXJpZmljYXRpb246ICJUaGUgaW5zdHJ1Y3Rpb24gcmVxdWlyZXMgb2J0YWluaW5nIHRoZSBwbHVtIGZpcnN0LiBNb3ZpbmcgZnJvbSB0aGUgc3RhcnQgcG9zaXRpb24gdG8gdGhlIHNpbmsgY291bnRlciBpcyB0aGUgZXhwZWN0ZWQgd2F5IHRvIHJlYWNoIHRoZSBwbHVtLCBzbyB0aGlzIG1vdmVtZW50IGFkdmFuY2VzIHRoZSB0YXNrLlxuYWN0aW9uX2lzX2NvcnJlY3Q6IHllcyIsCgo9PT09PT09PT0KCmFjdGlvbiAyOiAiPHRhc2s+UGxhY2UgYSBwbHVtIGFuZCBhIHBsYXRlIG9uIHRoZSBUViBzdGFuZC48L3Rhc2s+IDxwbGFuPlsncGljayBwbHVtJywgJ25hdmlnYXRlIHRvd2FyZCBUVi1zdGFuZCBhcmVhJywgJ3BsYWNlIHBsdW0nLCAnc2VhcmNoIGZvciBwbGF0ZScsICdwbGFjZSBwbGF0ZSddPC9wbGFuPiA8c3VidGFzaz5QaWNrIHBsdW08L3N1YnRhc2s+IDxzdWJ0YXNrX3JlYXNvbj5UaGUgcGx1bSBpcyB2aXNpYmxlIG9uIHRoZSBjb3VudGVyIGFuZCBub3cgd2l0aGluIHJlYWNoLCBzbyBpdCBzaG91bGQgYmUgcGlja2VkIHVwLjwvc3VidGFza19yZWFzb24+IDxhY3Rpb24+KCdwaWNrX3BsdW0ocm9ib3RfMCknLCAzMCk8L2FjdGlvbj4iLAoKdmVyaWZpY2F0aW9uOiAiR3Jhc3BpbmcgdGhlIHBsdW0gaXMgbmVjZXNzYXJ5IGJlZm9yZSB0cmFuc3BvcnRpbmcgaXQgYW55d2hlcmUuIFRoaXMgYWN0aW9uIGNsZWFybHkgbW92ZXMgdGhlIHRhc2sgZm9yd2FyZC5cbmFjdGlvbl9pc19jb3JyZWN0OiB5ZXMiCgo9PT09PT09PT09CgphY3Rpb24gMzogIjx0YXNrPkRlbGl2ZXIgdGhlIHBsdW0uPC90YXNrPiA8cGxhbj5bJ25hdmlnYXRlIHRvIFRWLXN0YW5kIGFyZWEnLCAncGxhY2UgcGx1bScsICdzZWFyY2ggZm9yIHBsYXRlJywgJ3BsYWNlIHBsYXRlJ108L3BsYW4+IDxzdWJ0YXNrPk5hdmlnYXRlIHRvIFRWLXN0YW5kIGFyZWE8L3N1YnRhc2s+IDxzdWJ0YXNrX3JlYXNvbj5UaGUgbG93IHdvb2RlbiB0YWJsZSBpbiBmcm9udCBvZiB0aGUgc29mYSBpcyBpZGVudGlmaWVkIGFzIHRoZSBzdXJmYWNlIGRpcmVjdGx5IGluIGZyb250IG9mIHRoZSB0ZWxldmlzaW9uLCBzbyBJIHNob3VsZCBtb3ZlIHRoZXJlIHRvIGRyb3AgdGhlIHBsdW0uPC9zdWJ0YXNrX3JlYXNvbj48YWN0aW9uPignbmF2KHJlY2VwdGFjbGVfYWFiYl9UYmwxX1RvcDFfZnJsX2FwYXJ0bWVudF90YWJsZV8wMSknLCA2KTwvYWN0aW9uPiIsCgp2ZXJpZmljYXRpb246ICJUaGUgaW5zdHJ1Y3Rpb24gc3BlY2lmaWVzIHRoZSBUViBzdGFuZCwgbm90IFRhYmxlIDEuIE5hdmlnYXRpbmcgdG8gVGFibGUgMSB0aGVyZWZvcmUgaGVhZHMgdG8gdGhlIHdyb25nIHJlY2VwdGFjbGUgYW5kIGRvZXMgbm90IGRpcmVjdGx5IGFkdmFuY2UgdGhlIHRydWUgZ29hbC5cbmFjdGlvbl9pc19jb3JyZWN0OiBubyIsCgo8Zm9sbG93aW5nIGFjdGlvbnMgb21pdGVkIGZvciBjbGFyaXR5Pg==)1Instruction:PlacebothaplumandaplateontotheTVstand\.23action1:"<task\>PlaceaplumandaplateontheTVstand\.</task\><plan\>\[’navigatetosinkcounter’,’pickplum’,’navigatetowardTV\-standarea’,’placeplum’,’searchforplate’,’placeplate’\]</plan\><subtask\>Navigatetosinkcounter</subtask\><subtask\_reason\>Theplumisnormallykeptonthekitchencounternearthesink,sothefirststepistomovethere\.</subtask\_reason\><action\>\(’nav\(receptacle\_aabb\_sink\_kitchen\_counter\)’,9\)</action\>",45verification:"Theinstructionrequiresobtainingtheplumfirst\.Movingfromthestartpositiontothesinkcounteristheexpectedwaytoreachtheplum,sothismovementadvancesthetask\.\\naction\_is\_correct:yes",67=========89action2:"<task\>PlaceaplumandaplateontheTVstand\.</task\><plan\>\[’pickplum’,’navigatetowardTV\-standarea’,’placeplum’,’searchforplate’,’placeplate’\]</plan\><subtask\>Pickplum</subtask\><subtask\_reason\>Theplumisvisibleonthecounterandnowwithinreach,soitshouldbepickedup\.</subtask\_reason\><action\>\(’pick\_plum\(robot\_0\)’,30\)</action\>",1011verification:"Graspingtheplumisnecessarybeforetransportingitanywhere\.Thisactionclearlymovesthetaskforward\.\\naction\_is\_correct:yes"1213==========1415action3:"<task\>Delivertheplum\.</task\><plan\>\[’navigatetoTV\-standarea’,’placeplum’,’searchforplate’,’placeplate’\]</plan\><subtask\>NavigatetoTV\-standarea</subtask\><subtask\_reason\>Thelowwoodentableinfrontofthesofaisidentifiedasthesurfacedirectlyinfrontofthetelevision,soIshouldmovetheretodroptheplum\.</subtask\_reason\><action\>\(’nav\(receptacle\_aabb\_Tbl1\_Top1\_frl\_apartment\_table\_01\)’,6\)</action\>",1617verification:"TheinstructionspecifiestheTVstand,notTable1\.NavigatingtoTable1thereforeheadstothewrongreceptacleanddoesnotdirectlyadvancethetruegoal\.\\naction\_is\_correct:no",1819<followingactionsomitedforclarity\>

### 11\.2ALFRED

Example 1 \(ALFRED\): The agent erroneously picks up the baseball instead of the tennis racket, and turns the lamp off instead of on\.[⬇](data:text/plain;base64,SW5zdHJ1Y3Rpb246IEdyYWIgYSB0ZW5uaXMgcmFja2V0IGJldHdlZW4gYmVkcywgdHVybiBvbiB0aGUgbGFtcAoKYWN0aW9uIDE6ICI8dGFzaz5JIG5lZWQgdG86ICgxKSBjb2xsZWN0IHRoZSBlbG9uZ2F0ZWQgc3BvcnRpbmcgb2JqZWN0IEkgY2FuIHNlZSBiZXR3ZWVuIHRoZSB0d28gYmVkcywgYW5kICgyKSBvcGVyYXRlIHRoZSBuZWFyYnkgbGFtcC48L3Rhc2s+XG48cGxhbj4xKSBGb2N1cyBvbiB0aGUgYmF0LXNoYXBlZCBpdGVtIGJldHdlZW4gdGhlIGJlZHMgYW5kIHNlY3VyZSBpdC4gMikgTG9jYXRlIHRoZSBkZXNrIGxhbXAgYW5kIHRvZ2dsZSBpdHMgcG93ZXIuPC9wbGFuPlxuPHN1YnRhc2s+TG9vayBmb3IgdGhlIHNwb3J0aW5nIGl0ZW0gSSBiZWxpZXZlIHRvIGJlIGEgQmFzZWJhbGxCYXQuPC9zdWJ0YXNrPlxuPHN1YnRhc2tfcmVhc29uPlRoZSBsb25nLCB0aGluIGhhbmRsZSBhbmQgcm91bmRlZCBoZWFkIHN1Z2dlc3QgYSBiYXQ7IEkgd2lsbCB0aGVyZWZvcmUgaXNzdWUgYSBzZWFyY2ggZm9yIGEgQmFzZWJhbGxCYXQuPC9zdWJ0YXNrX3JlYXNvbj5cbjxhY3Rpb24+ZmluZCBhIEJhc2ViYWxsQmF0PC9hY3Rpb24+IiwKCnZlcmlmaWNhdGlvbjogIlRoZSBpbnN0cnVjdGlvbiByZXF1aXJlcyB0aGUgYWdlbnQgdG8gbG9jYXRlIGEgVGVubmlzUmFja2V0LCBidXQgdGhlIGFnZW50IG1pc3Rha2VubHkgYXNzdW1lcyB0aGUgb2JqZWN0IGlzIGEgQmFzZWJhbGxCYXQgYW5kIGV4ZWN1dGVzIGEgc2VhcmNoIGZvciB0aGF0IGluc3RlYWQuICBBbHRob3VnaCB0aGUgJ2ZpbmQnIGNvbW1hbmQgaXRzZWxmIGV4ZWN1dGVzLCBpdCBhZHZhbmNlcyB0aGUgd3JvbmcgZ29hbC5cbiBpc19hY3Rpb25fY29ycmVjdDogbm8iLAoKCj09PT09PT09PT09PQoKCmFjdGlvbiAyOiAiPHRhc2s+VGhlIHNwb3J0aW5nIGl0ZW0gaGFzIGJlZW4gZGV0ZWN0ZWQuICBJIHdpbGwgbm93IHNlY3VyZSBpdC48L3Rhc2s+XG48cGxhbj5BcHByb2FjaCB0aGUgb2JqZWN0IGFuZCBncmFzcCBpdC48L3BsYW4+XG48c3VidGFzaz5QaWNrIHVwIHRoZSBCYXNlYmFsbEJhdC48L3N1YnRhc2s+XG48c3VidGFza19yZWFzb24+V2l0aCB0aGUgYmF0IGluIHZpZXcsIGdyaXBwaW5nIGl0IGJ5IHRoZSBoYW5kbGUgY29tcGxldGVzIHRoZSBmaXJzdCBoYWxmIG9mIHRoZSBpbnN0cnVjdGlvbi48L3N1YnRhc2tfcmVhc29uPlxuPGFjdGlvbj5waWNrIHVwIHRoZSBCYXNlYmFsbEJhdDwvYWN0aW9uPiIsCgp2ZXJpZmljYXRpb246ICJObyBCYXNlYmFsbEJhdCBpcyBwcmVzZW50OyB0aGUgZWxvbmdhdGVkIG9iamVjdCBpcyBhY3R1YWxseSBhIFRlbm5pc1JhY2tldC4gIEJlY2F1c2UgdGhlIHRhcmdldCBvYmplY3QgZG9lcyBub3QgZXhpc3QsIHRoZSBzaW11bGF0b3Igd291bGQgcmV0dXJuIGZhaWx1cmUuICBBdHRlbXB0aW5nIHRoaXMgcGljay11cCBuZWl0aGVyIGZ1bGZpbHMgbm9yIGFkdmFuY2VzIHRoZSBzdGF0ZWQgdGFzay5cbiBpc19hY3Rpb25fY29ycmVjdDogbm8iLAoKCj09PT09PT09PT09PQoKCmFjdGlvbiAzOiAiPHRhc2s+V2l0aCB0aGUgc3BvcnRpbmcgaXRlbSBoYW5kbGVkLCBJIHdpbGwgbm93IHRha2UgY2FyZSBvZiB0aGUgbGFtcC48L3Rhc2s+XG48cGxhbj5GaW5kIHRoZSBEZXNrTGFtcCwgdGhlbiB0b2dnbGUgaXRzIHBvd2VyLjwvcGxhbj5cbjxzdWJ0YXNrPkxvY2F0ZSB0aGUgRGVza0xhbXAuPC9zdWJ0YXNrPlxuPHN1YnRhc2tfcmVhc29uPlRoZSBsYW1wIGlzIHJlcXVpcmVkIG5leHQ7IEkgbXVzdCBicmluZyBpdCBpbnRvIHZpZXcgYmVmb3JlIGludGVyYWN0aW5nLjwvc3VidGFza19yZWFzb24+XG48YWN0aW9uPmZpbmQgYSBEZXNrTGFtcDwvYWN0aW9uPiIsCgp2ZXJpZmljYXRpb246ICJTZWFyY2hpbmcgZm9yIHRoZSBsYW1wIGlzIHNlbnNpYmxlIGV2ZW4gdGhvdWdoIHRoZSByYWNrZXQgaGFzbid0IGJlZW4gb2J0YWluZWQgeWV0LiAgVGhlIGFjdGlvbiBkb2VzIG1vdmUgdGhlIGFnZW50IHRvd2FyZCBvbmUgY29tcG9uZW50IG9mIHRoZSBnb2FsLCBzbyB0aGUgc3RlcCBpcyByZWFzb25hYmxlLlxuIGlzX2FjdGlvbl9jb3JyZWN0OiB5ZXMiCgoKPT09PT09PT09PT09CgoKYWN0aW9uIDQ6ICI8dGFzaz5UaGUgbGFtcCBpcyBub3cgaW4gcmVhY2guICBJIHdpbGwgdG9nZ2xlIGl0LjwvdGFzaz5cbjxwbGFuPkludGVyYWN0IHdpdGggdGhlIGxhbXAncyBzd2l0Y2guPC9wbGFuPlxuPHN1YnRhc2s+U3dpdGNoIHRoZSBEZXNrTGFtcCBvZmYuPC9zdWJ0YXNrPlxuPHN1YnRhc2tfcmVhc29uPlRoZSBsYW1wIGFwcGVhcnMgbGl0IGluIG15IHZpZXc7IHR1cm5pbmcgaXQgb2ZmIHdpbGwgbGVhdmUgaXQgaW4gdGhlIGRlc2lyZWQgb2ZmIHN0YXRlLjwvc3VidGFza19yZWFzb24+XG48YWN0aW9uPnR1cm4gb2ZmIHRoZSBEZXNrTGFtcDwvYWN0aW9uPiIsCgp2ZXJpZmljYXRpb246ICJUaGUgaW5zdHJ1Y3Rpb24gZXhwbGljaXRseSBzYXlzIHRvIHR1cm4gdGhlIGxhbXAgT04sIG5vdCBPRkYuICBUaGUgYWdlbnQgZG9lcyB0aGUgb3Bwb3NpdGUsIHNvIGV2ZW4gdGhvdWdoIHRoZSBlbnZpcm9ubWVudCBleGVjdXRlcyB0aGUgc3dpdGNoIGFjdGlvbiB0aGlzIGNvbnRyYWRpY3RzIHRoZSB0YXNrIHJlcXVpcmVtZW50LlxuIGlzX2FjdGlvbl9jb3JyZWN0OiBubyIsCg==)1Instruction:Grabatennisracketbetweenbeds,turnonthelamp23action1:"<task\>Ineedto:\(1\)collecttheelongatedsportingobjectIcanseebetweenthetwobeds,and\(2\)operatethenearbylamp\.</task\>\\n<plan\>1\)Focusonthebat\-shapeditembetweenthebedsandsecureit\.2\)Locatethedesklampandtoggleitspower\.</plan\>\\n<subtask\>LookforthesportingitemIbelievetobeaBaseballBat\.</subtask\>\\n<subtask\_reason\>Thelong,thinhandleandroundedheadsuggestabat;IwillthereforeissueasearchforaBaseballBat\.</subtask\_reason\>\\n<action\>findaBaseballBat</action\>",45verification:"TheinstructionrequirestheagenttolocateaTennisRacket,buttheagentmistakenlyassumestheobjectisaBaseballBatandexecutesasearchforthatinstead\.Althoughthe’find’commanditselfexecutes,itadvancesthewronggoal\.\\nis\_action\_correct:no",678============91011action2:"<task\>Thesportingitemhasbeendetected\.Iwillnowsecureit\.</task\>\\n<plan\>Approachtheobjectandgraspit\.</plan\>\\n<subtask\>PickuptheBaseballBat\.</subtask\>\\n<subtask\_reason\>Withthebatinview,grippingitbythehandlecompletesthefirsthalfoftheinstruction\.</subtask\_reason\>\\n<action\>pickuptheBaseballBat</action\>",1213verification:"NoBaseballBatispresent;theelongatedobjectisactuallyaTennisRacket\.Becausethetargetobjectdoesnotexist,thesimulatorwouldreturnfailure\.Attemptingthispick\-upneitherfulfilsnoradvancesthestatedtask\.\\nis\_action\_correct:no",141516============171819action3:"<task\>Withthesportingitemhandled,Iwillnowtakecareofthelamp\.</task\>\\n<plan\>FindtheDeskLamp,thentoggleitspower\.</plan\>\\n<subtask\>LocatetheDeskLamp\.</subtask\>\\n<subtask\_reason\>Thelampisrequirednext;Imustbringitintoviewbeforeinteracting\.</subtask\_reason\>\\n<action\>findaDeskLamp</action\>",2021verification:"Searchingforthelampissensibleeventhoughtherackethasn’tbeenobtainedyet\.Theactiondoesmovetheagenttowardonecomponentofthegoal,sothestepisreasonable\.\\nis\_action\_correct:yes"222324============252627action4:"<task\>Thelampisnowinreach\.Iwilltoggleit\.</task\>\\n<plan\>Interactwiththelamp’sswitch\.</plan\>\\n<subtask\>SwitchtheDeskLampoff\.</subtask\>\\n<subtask\_reason\>Thelampappearslitinmyview;turningitoffwillleaveitinthedesiredoffstate\.</subtask\_reason\>\\n<action\>turnofftheDeskLamp</action\>",2829verification:"TheinstructionexplicitlysaystoturnthelampON,notOFF\.Theagentdoestheopposite,soeventhoughtheenvironmentexecutestheswitchactionthiscontradictsthetaskrequirement\.\\nis\_action\_correct:no",

Example 2 \(ALFRED\): The agent erroneously places the wine bottle instead of the vase on the table\.[⬇](data:text/plain;base64,SW5zdHJ1Y3Rpb246IFB1dCBhIGdsYXNzIHZhc2Ugb24gYSB0YWJsZQoKCmFjdGlvbiAxOiAiPHRhc2s+SW5zdHJ1Y3Rpb24gb3V0c3RhbmRpbmc6IHBsYWNlIHRoZSB2YXNlIG9uIGEgdGFibGUuPC90YXNrPlxuPHBsYW4+MSkgTG9jYXRlIGEgc3VpdGFibGUgZGVjb3JhdGl2ZSB2ZXNzZWw7IDIpIFBpY2sgaXQgdXA7IDMpIEJyaW5nIGl0IHRvIGEgdGFibGU7IDQpIFNldCBpdCBkb3duLjwvcGxhbj5cbjxzdWJ0YXNrPkJlZ2luIGJ5IGxvY2F0aW5nIGEgbmljZSBnbGFzcyB2ZXNzZWwgKGdvaW5nIHRvIGxvb2sgZm9yIGEgV2luZUJvdHRsZSB0aGF0IHdpbGwgc2VydmUgYXMgdGhlIHZhc2UpLjwvc3VidGFzaz5cbjxzdWJ0YXNrX3JlYXNvbj5UaGUgc2hlbHZlcyBvZnRlbiBob2xkIHRhbGwgZ2xhc3MgY29udGFpbmVyczsgYSB3aW5lIGJvdHRsZSBjYW4gZG91YmxlIGFzIGEgdmFzZS48L3N1YnRhc2tfcmVhc29uPlxuPGFjdGlvbj5maW5kIGEgV2luZUJvdHRsZTwvYWN0aW9uPiIsCgp2ZXJpZmljYXRpb246ICJUaGUgaW5zdHJ1Y3Rpb24gZXhwbGljaXRseSBzYXlzIHRvIHB1dCBhIGdsYXNzIHZhc2UgKG9iamVjdCBjbGFzcyBHbGFzc2JvdHRsZS9WYXNlKSBvbiBhIHRhYmxlLiBUaGUgYXNzaXN0YW50IGNob29zZXMgdG8gc2VhcmNoIGZvciBhIFdpbmVCb3R0bGUgaW5zdGVhZC4gQWx0aG91Z2ggYm90aCBhcmUgbWFkZSBvZiBnbGFzcywgYSB3aW5lIGJvdHRsZSBpcyBub3QgdGhlIHJlcXVlc3RlZCB2YXNlLCBzbyB0aGlzIGFjdGlvbiBkb2VzIG5vdCBkaXJlY3RseSBwcm9ncmVzcyB0b3dhcmQgc2F0aXNmeWluZyB0aGUgaW5zdHJ1Y3Rpb24uXG5hY3Rpb25faXNfY29ycmVjdDogbm8iLAoKCj09PT09PT09PT09PT0KCgphY3Rpb24gMjogIjx0YXNrPlZlc3NlbCBmb3VuZDsgbmV4dCBzdGVwIGlzIHRvIHNlY3VyZSBpdC48L3Rhc2s+XG48cGxhbj5QaWNrIHRoZSBib3R0bGUgdXAsIHRoZW4gaGVhZCB0byBhIHRhYmxlLjwvcGxhbj5cbjxzdWJ0YXNrPkdyYXNwIHRoZSBXaW5lQm90dGxlLjwvc3VidGFzaz5cbjxzdWJ0YXNrX3JlYXNvbj5UaGUgb2JqZWN0IGlzIHdpdGhpbiByZWFjaCBhbmQgSSBuZWVkIHRvIGNhcnJ5IGl0IHRvIHRoZSB0YWJsZS48L3N1YnRhc2tfcmVhc29uPlxuPGFjdGlvbj5waWNrIHVwIHRoZSBXaW5lQm90dGxlPC9hY3Rpb24+IiwKCnZlcmlmaWNhdGlvbjogIkFnYWluLCB0aGUgYWdlbnQgbWFuaXB1bGF0ZXMgYSBXaW5lQm90dGxlIGluc3RlYWQgb2YgdGhlIHJlcXVpcmVkIGdsYXNzIHZhc2UuIFBpY2tpbmcgaXQgdXAgc3VjY2VlZHMgbWVjaGFuaWNhbGx5LCBidXQgdGhlIGNob3NlbiBvYmplY3QgaXMgaW5jb3JyZWN0IGZvciB0aGUgdGFzay5cbmFjdGlvbl9pc19jb3JyZWN0OiBubyIsCgoKPT09PT09PT09PT09PQoKCmFjdGlvbiAzOiAiPHRhc2s+T2JqZWN0IHNlY3VyZWQ7IG5vdyBuZWVkIGEgdGFibGUuPC90YXNrPlxuPHBsYW4+TmF2aWdhdGUgdG8gYSBkaW5pbmcgdGFibGUgaW4gdGhlIHJvb20uPC9wbGFuPlxuPHN1YnRhc2s+RmluZCBhIERpbmluZ1RhYmxlLjwvc3VidGFzaz5cbjxzdWJ0YXNrX3JlYXNvbj5BbnkgdGFibGUgc3VyZmFjZSB3aWxsIHdvcmsgYXMgYSBkaXNwbGF5IHNwb3QuPC9zdWJ0YXNrX3JlYXNvbj5cbjxhY3Rpb24+ZmluZCBhIERpbmluZ1RhYmxlPC9hY3Rpb24+IiwKCnZlcmlmaWNhdGlvbjogIk5hdmlnYXRpbmcgdG93YXJkIGEgdGFibGUgaXMgZ2VuZXJhbGx5IHJlcXVpcmVkLCBidXQgYmVjYXVzZSB0aGUgYWdlbnQgaXMgaG9sZGluZyB0aGUgd3Jvbmcgb2JqZWN0LCB0aGlzIHNlYXJjaCBkb2VzIG5vdCByZWFsbHkgYnJpbmcgaXQgY2xvc2VyIHRvIGNvbXBsZXRpbmcgdGhlIGluc3RydWN0ZWQgdGFzay4gVGhlcmVmb3JlIHRoZSBhY3Rpb24sIHdoaWxlIHBsYXVzaWJsZSwgZG9lcyBub3QgYWR2YW5jZSB0aGUgdHJ1ZSBnb2FsLlxuYWN0aW9uX2lzX2NvcnJlY3Q6IG5vIiwKCgo9PT09PT09PT09PT09CgoKYWN0aW9uIDQ6ICI8dGFzaz5UYWJsZSBsb2NhdGVkOyB0aW1lIHRvIHBsYWNlIHRoZSBkZWNvcmF0aXZlIGJvdHRsZS48L3Rhc2s+XG48cGxhbj5SZWxlYXNlIHRoZSBib3R0bGUgb250byB0aGUgdGFibGUgc3VyZmFjZS48L3BsYW4+XG48c3VidGFzaz5QdXQgdGhlIGNhcnJpZWQgYm90dGxlIGRvd24gb24gdGhlIHRhYmxlLjwvc3VidGFzaz5cbjxzdWJ0YXNrX3JlYXNvbj5UaGUgcmVjZXB0YWNsZSBpcyBpbiByZWFjaCBhbmQgSSBubyBsb25nZXIgbmVlZCB0byB0cmFuc3BvcnQgdGhlIGl0ZW0uPC9zdWJ0YXNrX3JlYXNvbj5cbjxhY3Rpb24+cHV0IGRvd24gdGhlIG9iamVjdCBpbiBoYW5kPC9hY3Rpb24+IiwKCnZlcmlmaWNhdGlvbjogIlRoZSBhc3Npc3RhbnQgcGxhY2VzIHRoZSBXaW5lQm90dGxlIG9uIHRoZSB0YWJsZSBpbnN0ZWFkIG9mIGEgZ2xhc3MgdmFzZS4gVGhlIHJlY2VwdGFjbGUgaXMgY29ycmVjdCAoYSB0YWJsZSksIGJ1dCB0aGUgb2JqZWN0IGRvZXMgbm90IG1hdGNoIHRoZSBpbnN0cnVjdGlvbi4gQ29uc2VxdWVudGx5LCB0aGUgaGlnaC1sZXZlbCB0YXNrIHJlbWFpbnMgdW5zYXRpc2ZpZWQuXG5hY3Rpb25faXNfY29ycmVjdDogbm8iLA==)1Instruction:Putaglassvaseonatable234action1:"<task\>Instructionoutstanding:placethevaseonatable\.</task\>\\n<plan\>1\)Locateasuitabledecorativevessel;2\)Pickitup;3\)Bringittoatable;4\)Setitdown\.</plan\>\\n<subtask\>Beginbylocatinganiceglassvessel\(goingtolookforaWineBottlethatwillserveasthevase\)\.</subtask\>\\n<subtask\_reason\>Theshelvesoftenholdtallglasscontainers;awinebottlecandoubleasavase\.</subtask\_reason\>\\n<action\>findaWineBottle</action\>",56verification:"Theinstructionexplicitlysaystoputaglassvase\(objectclassGlassbottle/Vase\)onatable\.TheassistantchoosestosearchforaWineBottleinstead\.Althoughbotharemadeofglass,awinebottleisnottherequestedvase,sothisactiondoesnotdirectlyprogresstowardsatisfyingtheinstruction\.\\naction\_is\_correct:no",789=============101112action2:"<task\>Vesselfound;nextstepistosecureit\.</task\>\\n<plan\>Pickthebottleup,thenheadtoatable\.</plan\>\\n<subtask\>GrasptheWineBottle\.</subtask\>\\n<subtask\_reason\>TheobjectiswithinreachandIneedtocarryittothetable\.</subtask\_reason\>\\n<action\>pickuptheWineBottle</action\>",1314verification:"Again,theagentmanipulatesaWineBottleinsteadoftherequiredglassvase\.Pickingitupsucceedsmechanically,butthechosenobjectisincorrectforthetask\.\\naction\_is\_correct:no",151617=============181920action3:"<task\>Objectsecured;nowneedatable\.</task\>\\n<plan\>Navigatetoadiningtableintheroom\.</plan\>\\n<subtask\>FindaDiningTable\.</subtask\>\\n<subtask\_reason\>Anytablesurfacewillworkasadisplayspot\.</subtask\_reason\>\\n<action\>findaDiningTable</action\>",2122verification:"Navigatingtowardatableisgenerallyrequired,butbecausetheagentisholdingthewrongobject,thissearchdoesnotreallybringitclosertocompletingtheinstructedtask\.Thereforetheaction,whileplausible,doesnotadvancethetruegoal\.\\naction\_is\_correct:no",232425=============262728action4:"<task\>Tablelocated;timetoplacethedecorativebottle\.</task\>\\n<plan\>Releasethebottleontothetablesurface\.</plan\>\\n<subtask\>Putthecarriedbottledownonthetable\.</subtask\>\\n<subtask\_reason\>ThereceptacleisinreachandInolongerneedtotransporttheitem\.</subtask\_reason\>\\n<action\>putdowntheobjectinhand</action\>",2930verification:"TheassistantplacestheWineBottleonthetableinsteadofaglassvase\.Thereceptacleiscorrect\(atable\),buttheobjectdoesnotmatchtheinstruction\.Consequently,thehigh\-leveltaskremainsunsatisfied\.\\naction\_is\_correct:no",

## 12Additional Qualitative Examples of Verification During Inference

### 12\.1LangR

![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_llarp_1_01.png)![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_llarp_1_02.png)

Instruction:Procure apurple fruitand deliver it to the left counter\.Sampled Action:<task\>Find and pick up the purple fruit and then deliver it to the left counter\.</task\> <plan\>\[Search TV stand, search sofa, search table2, pick fruit, navigate to left counter, place fruit\]</plan\> <subtask\_reason\>At the TV stand, I should try a pick to see if the fruit is here, hence pick\(object\)\.</subtask\_reason\> <subtask\>Attempt to pick the fruit at the TV stand\.</subtask\> <action\>\(’pick\_strawberry\(robot\_0\)’, 38\)</action\>Verification CoT:The assistanttries to pick a strawberry instead of the requested purple fruit\. This does not help fulfil the instruction and therefore does not advance the task\.action\_is\_correct: no

Figure 7:Verification example on the LangR benchmark\. Here, the task is to move a “purple fruit”, but the agent tries to pick up a strawberry\. The verifier correctly identifies that the strawberry is not the right object, classifying the action as incorrect\.![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qualitative_example_llarp_2_01.png)![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qualitative_example_llarp_2_02.png)

Instruction:Move the rubriks cube to the brown table, the lego to the brown table, and the bowl to the brown table\.Sampled Action:<task\>Remaining: locate and pick the bowl \(it was not found at the TV stand\)\.</task\> <plan\>\[Return to table1 to search for bowl, then cabinet area, pick bowl, navigate to brown table, place bowl\]</plan\> <subtask\_reason\>I am at the brown table and need to check if the bowl is here, hence I should attempt aplace actionto see if it is present</subtask\_reason\> <subtask\>attempt to place bowl on brown table</subtask\> <action\>\(’place\(receptacle\_aabb\_Tbl2\_Top1\_frl\_apartment\_table\_02\)’, 51\)</action\>Verification CoT:The robot is not holding any object \(the previous pick attempt failed and no new object was picked\)\.Attempting to place an item while not holding anything cannot succeed and does not progress the task\.action\_is\_correct: no

Figure 8:Verification example on the LangR benchmark\. Here, the agent erroneously attempts to place the bowl without successfully picking it up first\. The verifier identifies this mistake, classifying the action as incorrect\.
### 12\.2ALFRED

![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_alfred_1_01.png)![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_alfred_1_02.png)

Instruction:Placewashed leafy green vegetableina receptacle that can keep it fresh for several daysSampled Action:<task\>The fridge is open and I \(mistakenly\) seeanother lettuce inside; I still intend to store the held lettuce\.</task\> <plan\>\[Optional\] Decide whether to store this extra lettuce or switch back to my original intent, then …</plan\> <subtask\_reason\>I amuncertain of needing both wastebulbs;closingkeeps the environment tidy while I reconsider\.</subtask\_reason\> <subtask\>Close the fridge door\.</subtask\> <action\>close the Fridge</action\>Verification CoT:Closing the fridge at this moment is counter\-productive because the lettuce needs to remain inside to stay fresh\. Although syntactically valid, it moves the state away from the goal\.action\_is\_correct: no

Figure 9:Verification example on the ALFRED benchmark\. Here, the agent incorrectly attempts to close the fridge before putting the lettuce inside because it mistakenly detects a second lettuce in the fridge\. The verifier identifies this mistake, classifying the action as incorrect\.![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_alfred_2_01.png)![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_alfred_2_02.png)

Instruction:As you prepare for dinner, arrange an eating surface that will hold your food with aspoonin it on the kitchen table\.Sampled Action:<task\>Entire task still outstanding: create a state where the dining\-table has a container holding a spoon resting on it\.</task\> <plan\>\[1\] locate a spoon\-like utensil, \[2\] pick it up, \[3\] find a bowl or cup, \[4\] place utensil inside it, \[5\] carry the filled utensil to the dining\-table, \[6\] set it down on the table\.</plan\> <subtask\_reason\>Before anything else I need a spoon; without knowing its position I cannot grasp or use it\. Therefore I must visually search for an appropriate utensil first\.</subtask\_reason\> <subtask\>Locate a spoon\-like utensil\.</subtask\> <action\>find a Ladle</action\>Verification CoT:The user’s instruction explicitly calls for a ”spoon\.” The agentinstead decides to search for a Ladle, which is a different utensil\. Although both are eating utensils,a ladle is not the requested object\. Thus the chosen action does not advance the stated goal directly\.action\_is\_correct: no

Figure 10:Verification example on the ALFRED benchmark\. Here, the task is to move a spoon, but the agent proposes to move a ladle, which is not in the scene\. The verifier correctly identifies that the correct object is a spoon, classifying the action as incorrect\.![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_alfred_3_01.png)![Refer to caption](https://arxiv.org/html/2605.12620v1/images/suppl_qual_example_alfred_3_02.png)

Instruction:During a cooking session, keep a cleanflipping toolon a wooden table\.Sampled Action:<task\>Entire task still outstanding: locate a suitable flipping utensil, make sure it is clean, and finally leave it resting on the wooden dining table\.</task\> <plan\>1\)Find a knife\. 2\) Pick it up\. 3\) Go to a sink\. 4\) Wash the knife by turning water on then off\. 5\) Bring the clean knife to the wooden dining table and set it down there\.</plan\> <subtask\_reason\>The very first requirement is to knowwhere a knife is; without visual contact I cannot interact with it or progress toward washing or placing\. Therefore I need to perform a search action that will move my sensors through the scene until a knife is detected\.</subtask\_reason\> <subtask\>Find a knife</subtask\> <action\>find a Knife</action\>Verification CoT:The instruction asks for any kind of cleaning flipping tool \(i\.e\., a knife\)\. By performing the action ”find a Knife” the agent begins the search for such an object, which is a reasonable first step toward accomplishing the goal\.action\_is\_correct: yes

Figure 11:Verification example on the ALFRED benchmark\. Here, the task is to move a spatula, but the agent proposes to move a knife\. Theverifier fails to identify this mistakeand incorrectly classifies the action as correct\.

Similar Articles

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

arXiv cs.CL

AgentV-RL introduces an Agentic Verifier framework that enhances reward modeling through bidirectional verification with forward and backward agents augmented with tools, achieving 25.2% improvement over state-of-the-art ORMs. The approach addresses error propagation and grounding issues in verifiers for complex reasoning tasks through multi-turn deliberative processes combined with reinforcement learning.

Learning Agentic Policy from Action Guidance

arXiv cs.CL

The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.