GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

arXiv cs.LG Papers

Summary

GROW proposes a novel reinforcement learning framework that adapts GRPO to multi-turn VLM agent tasks by decomposing trajectories into state-action pairs and computing advantages between them, achieving state-of-the-art performance on over 800 Minecraft tasks.

arXiv:2605.20246v1 Announce Type: new Abstract: Recently, vision-language model (VLM) agents have shown promising progress in open-world tasks, where successful task completion often requires multiple turns of visual perception and action execution. However, existing methods still rely primarily on Supervised Fine-Tuning (SFT) with expert demonstrations, while the advanced reinforcement learning (RL) algorithm, specifically Group Relative Policy Optimization (GRPO), has not been effectively employed for multi-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise. To address this issue, we propose GROW, a RL framework for open-world VLM agents that decomposes collected trajectories into state-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions. Experiments on more than 800 Minecraft tasks show that our method achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of our proposed RL framework for open-world VLM agents.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:20 AM

# GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents
Source: [https://arxiv.org/html/2605.20246](https://arxiv.org/html/2605.20246)
Xiongbin Wu1,2Zhihao Luo2,3Shanzhe Lei2Lechao Zhang2,3Xuhong Wang2JIE YANG1Zhonglong Zheng4Yuanjie Zheng5Xin Tan2,3Wei Liu1

1Shanghai Jiao Tong University2Shanghai Artificial Intelligence Laboratory 3East China Normal University4Zhejiang Normal University5Shandong Normal University

###### Abstract

Recently, vision\-language model \(VLM\) agents have shown promising progress in open\-world tasks, where successful task completion often requires multiple turns of visual perception and action execution\. However, existing methods still rely primarily on Supervised Fine\-Tuning \(SFT\) with expert demonstrations, while the advanced reinforcement learning \(RL\) algorithm, specifically Group Relative Policy Optimization \(GRPO\), has not been effectively employed for multi\-turn RL in these tasks because standard GRPO requires full trajectories as training samples which leads to excessively long context and noise\. To address this issue, we propose GROW, a RL framework for open\-world VLM agents that decomposes collected trajectories into state\-action samples, and computes advantages between these samples rather than treating a full trajectory as a single entity\. We further provide a surrogate analysis indicating that, even though the grouped samples are conditioned on different local states rather than an identical prompt context, the objective can preserve the core relative policy optimization signal of GRPO under simplifying assumptions\. Experiments on more than 800 Minecraft tasks show that our method achieves state\-of\-the\-art \(SOTA\) performance, demonstrating the effectiveness of our proposed RL framework for open\-world VLM agents\.

## 1Introduction

VLM agents have become increasingly capable in open\-world environments\[[20](https://arxiv.org/html/2605.20246#bib.bib27),[15](https://arxiv.org/html/2605.20246#bib.bib28),[9](https://arxiv.org/html/2605.20246#bib.bib7),[29](https://arxiv.org/html/2605.20246#bib.bib53)\]\. In these domains, an agent must repeatedly interpret visual states, choose actions, and adapt its behavior through multi\-turn interaction with the environment\. Most recent efforts improve such agents through SFT on expert demonstrations\[[16](https://arxiv.org/html/2605.20246#bib.bib54),[23](https://arxiv.org/html/2605.20246#bib.bib43)\], enabling them to imitate task\-relevant perception\-action behaviors from curated demonstrations\. However, SFT relies on large amounts of high\-quality expert data, whose collection is often expensive and difficult to scale\. Moreover, prior studies\[[19](https://arxiv.org/html/2605.20246#bib.bib65),[25](https://arxiv.org/html/2605.20246#bib.bib64),[27](https://arxiv.org/html/2605.20246#bib.bib63)\]have shown that SFT alone can lag behind RL\-trained VLM agents in performance\. These limitations motivate the need for advanced RL methods that can effectively train open\-world VLM agents through interaction\.

Recent researchs on VLM\[[13](https://arxiv.org/html/2605.20246#bib.bib47),[8](https://arxiv.org/html/2605.20246#bib.bib59),[7](https://arxiv.org/html/2605.20246#bib.bib62),[27](https://arxiv.org/html/2605.20246#bib.bib63),[25](https://arxiv.org/html/2605.20246#bib.bib64),[19](https://arxiv.org/html/2605.20246#bib.bib65)\]which use GRPO\[[17](https://arxiv.org/html/2605.20246#bib.bib21)\]as their RL algorithms have shown the effectiveness of GRPO in improving foundation\-model policies through group\-wise relative optimization\. By comparing multiple sampled outputs within the same group, GRPO constructs relative advantages without training an additional value model, making it especially suitable for large\-scale VLM optimization where value estimation can be costly and unstable\. These strengths make GRPO a natural algorithmic basis for refining VLM agents through environment interaction\. However, directly transferring GRPO to open\-world tasks is nontrivial\. Standard GRPO compute advantages across trajectories conditioned on the same prompts and optimize these full trajectories\. As shown in Figure[1](https://arxiv.org/html/2605.20246#S1.F1), it can introduce excessively long context and also include too much noise in the context when full trajectories are used to predict actions in open\-world tasks\.

![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/context_len_vs_steps.png)Figure 1:Context length in the trajectory increases with the number of interaction steps between the VLM agent and the environment, often exceeding the maximum token length as interactions accumulate\.To address this issue, we proposeGRPO foropen\-world VLM agents \(GROW\), a RL framework that adapting GRPO to open\-world tasks where trajectories are often too long and agents make decisions often based on short\-horizon states\. GROW first applies trajectory decomposition to collected rollouts trajectories and then compute relative advantages among the state\-action samples within the same rollout groups\. However, this reformulation also introduces a theoretical issue that does not arise in standard GRPO\. After trajectory decomposition, grouped samples are no longer different responses to the same prompt\. At this, we provide a surrogate analysis under simplifying assumptions shows that the proposed objective can still provide an effective relative policy optimization signal\.

Our main contributions are summarized as follows:

- •We propose GROW, a RL framework for open\-world VLM agents\. The framework combines cold\-start training on state\-action samples with GRPO\-based policy refinement through trajectory decomposition\.
- •We provide a surrogate analysis suggesting that, under reasonable tractability approximations, relative policy optimization remains effective in GROW even when grouped samples are conditioned on different local states rather than an identical prompt context\.
- •We instantiate and test the GROW mainly in Minecraft\[[12](https://arxiv.org/html/2605.20246#bib.bib37),[9](https://arxiv.org/html/2605.20246#bib.bib7),[2](https://arxiv.org/html/2605.20246#bib.bib38)\]\. Across more than 800 Minecraft tasks, ranging from embodied spatial navigation and precise GUI manipulation to highly dynamic combat scenarios, our method achieves SOTA performance demonstrating that GROW establishes a new SOTA in both success rate and execution efficiency\. Notably, our framework exhibits strong generalization to unseen tasks and fosters sophisticated behavioral skills, such as active target reacquisition and distractor\-robust GUI operation , proving its effectiveness in learning reusable interaction strategies rather than merely memorizing trajectories\.

## 2Related Work

### 2\.1Agents in the Open World

Open\-world tasks provide an important testbed for developing general\-purpose agents because they require agents to perceive visual observations, reason about dynamic environments, and execute actions over extended interaction sequences\. Minecraft is a representative example due to its high degree of freedom and broad task diversity, spanning embodied control, resource manipulation, and precise graphical user interface interactions\. Following the seminal work of VPT\[[3](https://arxiv.org/html/2605.20246#bib.bib6)\]which leverages large\-scale expert demonstrations and explores RL for fine\-tuning, subsequent studies have advanced agent learning in several directions\. Building upon VPT\[[3](https://arxiv.org/html/2605.20246#bib.bib6)\], STEVE\-1\[[11](https://arxiv.org/html/2605.20246#bib.bib24)\]is trained for text\-to\-behavior generation in Minecraft, enabling users to use text instructions to control agents for completing short\-horizon, open\-ended tasks relying on raw pixels and low\-level controls\. ROCKET\-1\[[5](https://arxiv.org/html/2605.20246#bib.bib56)\]introduces visual\-temporal context prompting to connect high\-level VLM reasoning with low\-level policy execution for spatially grounded interaction\. ROCKET\-3\[[4](https://arxiv.org/html/2605.20246#bib.bib26)\]further improves exploration in unseen environments through RL with cross\-view reasoning\. More recently, research has increasingly focused on VLM agents\. For example, JARVIS\-VLA\[[9](https://arxiv.org/html/2605.20246#bib.bib7)\]adopts staged training to improve task completion in Minecraft, while similar imitation\-based VLM agents have also shown effectiveness in other game environments such asGenshin Impact\[[20](https://arxiv.org/html/2605.20246#bib.bib27)\]and Steam games\[[15](https://arxiv.org/html/2605.20246#bib.bib28)\]\. Despite this progress, efficient RL methods for VLM agents remain underexplored, as most existing approaches still depend heavily on imitation learning\. Our work addresses this gap by providing a scalable RL framework for training open\-world VLM agents\.

### 2\.2GRPO for Multi\-Turn VLM Agents

GRPO has been widely explored for RL in tasks requiring many rounds of interaction\. AgentGym\-RL\[[24](https://arxiv.org/html/2605.20246#bib.bib2)\]studies multi\-turn RL for large language model agents and improves long\-horizon decision making with a curriculum over interaction length\. InquireMobile\[[1](https://arxiv.org/html/2605.20246#bib.bib49)\]and ColorAgent\[[10](https://arxiv.org/html/2605.20246#bib.bib50)\]extend this paradigm to settings where the agent must request authorization before taking actions or incorporate human instructions during execution\. AGENTRL\[[26](https://arxiv.org/html/2605.20246#bib.bib33)\]further investigates system\-level scheduling and resource allocation for GRPO\-based multi\-turn training in order to improve training efficiency\. However, although these works substantially broaden the study of GRPO in multi\-turn settings, they mainly retain a trajectory\-level or dialogue\-level formulation, where each optimization sample may contain an increasingly long interaction history\. When directly applied to open\-world tasks, such full\-trajectory samples can introduce substantial irrelevant noise and lead to excessive context accumulation as trajectories grow longer\. Our work addresses this issue by decomposing trajectories into state\-action samples, and further provides a surrogate analysis showing that, although this departs from the identical\-prompt grouping assumption of standard GRPO, it still preserves a valid and effective relative policy optimization signal\.

## 3Method

### 3\.1Notation

We formulate the process of executing open\-world tasks as a Markov decision process \(MDP\), denoted byℳ=⟨𝒞,𝒮,𝒜,ℛ,γ⟩\\mathcal\{M\}=\\langle\\mathcal\{C\},\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{R\},\\gamma\\rangle\. Here,𝒞\\mathcal\{C\}denotes the task space, which contains a set of heterogeneous tasks𝒞=\{c1,c2,c3,…\}\\mathcal\{C\}=\\\{c\_\{1\},c\_\{2\},c\_\{3\},\\ldots\\\}, where each task corresponds to a concrete objective such askill zombieormine gold ore\. The state space𝒮\\mathcal\{S\}denotes the set of states, where each state corresponds to the current observation or a short history of recent observations together with the task instruction\. The action space𝒜\\mathcal\{A\}consists of primitive keyboard and mouse operations, such asKEYDOWNandMOUSE\_MOVE, ensuring applicability to both embodied interactions and graphical user interface \(GUI\) manipulation\. This action space supports both embodied interaction and GUI manipulation\. We provide more details about the action space in Appendix[4](https://arxiv.org/html/2605.20246#A1.T4)\. A trajectoryτ\\tauis defined as a sequence of state\-action pairs, i\.e\.τ=\{\(s1,a1\),\(s1,a1\),…,\(sH,aH\)\}\\tau=\\\{\(s\_\{1\},a\_\{1\}\),\(s\_\{1\},a\_\{1\}\),\\dots,\(s\_\{H\},a\_\{H\}\)\\\}whereHHis the length ofτ\\tau\. We consider a sparse and verifiable reward setting, whereℛ​\(τ\)=1\\mathcal\{R\}\(\\tau\)=1only when successful task completion can be verified, andℛ​\(τ\)=0\\mathcal\{R\}\(\\tau\)=0otherwise\. Finally,γ∈\(0,1\)\\gamma\\in\(0,1\)denotes the discount factor\.

### 3\.2GROW Framework

Figure[2](https://arxiv.org/html/2605.20246#S3.F2)provides an overview of GROW, our proposed RL framework\. During rollout phase, state\-action samples are collected by decomposing the rollout trajectories and then compute relative advantages among the state\-action samples belonging to the same rollout groups\. This design preserves the state\-action modeling paradigm commonly used in open\-world tasks, while avoiding the need to optimize an entire rollout trajectory as a single full\-context sample\.

#### 3\.2\.1Decomposition of Rollout Trajectories

During the rollout phase,GGparallel environments are instantiated for each task instruction\. In each environment, the VLM agent receives states from the environments and selects the corresponding actions to perform the tasks\. Then a group of trajectories\{τi\}i=1G\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{G\}are collected in theGGenvironments, where each trajectory is treated as a single training sample in standard GRPO\. For open\-world tasks that require many interaction turns, such full\-trajectory samples can introduce substantial irrelevant information and lead to excessive context accumulation, ultimately degrading the quality of policy gradient estimation\.

To address this issue, we decompose each collected trajectory into a set of fine\-grained state\-action samples\. Specifically, each trajectoryτ=\{\(s1,a1\),\(s2,a2\),…,\(sH,aH\)\}\\tau=\\\{\(s\_\{1\},a\_\{1\}\),\(s\_\{2\},a\_\{2\}\),\\dots,\(s\_\{H\},a\_\{H\}\)\\\}is decomposed in a step\-by\-step manner, where each individual transition serves as an independent optimization unit\. To assign learning signals to the decomposed samples, we propagate the sparse episodic reward backward along each trajectory with a discount factor:

ri,t=γHi−t​ℛ​\(τi\),r\_\{i,t\}=\\gamma^\{H\_\{i\}\-t\}\\mathcal\{R\}\(\\tau\_\{i\}\),\(1\)whereγ∈\(0,1\)\\gamma\\in\(0,1\)is the discount factor\. This temporal discounting ensures that state\-action samples closer to task completion receive stronger learning signals, reflecting their higher causal relevance to the final outcome\. The rollout group is therefore transformed into a group of state\-action samples, i\.e\.Gs=\{\(si,t,ai,t,ri,t\)∣i∈\[1,G\],t∈\[1,Hi\]\}G\_\{\\mathrm\{s\}\}=\\\{\(s\_\{i,t\},a\_\{i,t\},r\_\{i,t\}\)\\mid i\\in\[1,G\],\\,t\\in\[1,H\_\{i\}\]\\\}\. By this means, we reformulate the standard GRPO by changing the optimization units from full\-trajectory samples to fine\-grained state\-action samples\.

#### 3\.2\.2Policy Optimization with State\-Action Samples

Unlike standard GRPO, where relative advantages are computed among the trajectories in the same groups, GROW computes relative advantages over the decomposed state\-action samples within each rollout group\. As illustrated by the advantage computation module in Figure[2](https://arxiv.org/html/2605.20246#S3.F2), the discounted rewards are normalized across state\-action samples derived from the same rollout groups to obtain the advantage:

Ai,t=ri,t−μσ,A\_\{i,t\}=\\frac\{r\_\{i,t\}\-\\mu\}\{\\sigma\},\(2\)whereμ\\muandσ\\sigmaare the mean and standard deviation of the reward set\{ri,t∣i∈\[1,G\],t∈\[1,Hi\]\}\\\{r\_\{i,t\}\\mid i\\in\[1,G\],\\,t\\in\[1,H\_\{i\}\]\\\}\. This leads to our training objective, which is defined as:

𝒥=𝔼c∼𝒞a∼πold\(⋅∣s\)​\{1G​∑i=1G1Hi​∑t=1Himin⁡\(ρi,t​\(θ\)​Ai,t,clip​\(ρi,t​\(θ\),1−ϵ,1\+ϵ\)​Ai,t\)\}\\mathcal\{J\}=\\underset\{\\begin\{subarray\}\{c\}c\\sim\\mathcal\{C\}\\\\ a\\sim\\pi\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\end\{subarray\}\}\{\\mathbb\{E\}\}\\Biggl\\\{\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{H\_\{i\}\}\\sum\_\{t=1\}^\{H\_\{i\}\}\\min\\left\(\\rho\_\{i,t\}\(\\theta\)A\_\{i,t\},\\mathrm\{clip\}\(\\rho\_\{i,t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)A\_\{i,t\}\\right\)\\Biggr\\\}\(3\)whereρi,t​\(θ\)=πθ​\(ai,t\|si,t\)πold​\(ai,t\|si,t\)\\rho\_\{i,t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(a\_\{i,t\}\|s\_\{i,t\}\)\}\{\\pi\_\{\\mathrm\{old\}\}\(a\_\{i,t\}\|s\_\{i,t\}\)\}is the probability ratio\.

![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/main_picture.png)Figure 2:Overview of GROW, a RL framework for open\-world VLM agents\. Standard GRPO collects full\-trajectory samples from theGGrollouts, which often leads to excessively long and noisy contexts in open\-world tasks\. In the contrast, GROW addresses this issue by decomposing these trajectories into state\-action samples, ensuring appropriate context lengths\.

### 3\.3Surrogate Analysis

In this section, we provide a surrogate analysis showing that computing relative advantages over decomposed state\-action samples can induce a meaningful relative optimization signal, despite departing from the identical\-prompt setting of standard GRPO\.

Since the required number of interactions to complete a specific task is generally consistent across parallel environments, and we impose a maximum step limit on unsuccessful trajectories, we can approximate the trajectory length for a given taskccas a uniform constantHH, namelyHi≈HH\_\{i\}\\approx H\. We provide the statistics of task step counts in Appendix[B](https://arxiv.org/html/2605.20246#A2)\. Under this mild assumption, the mean reward over all decomposed state\-action samples in the rollout group can be written as

μ\\displaystyle\\mu=1∑i=1GHi​∑i=1G∑t=1Hiri,t=1∑i=1GHi​∑i=1G∑t=1HiγHi−t​ℛ​\(τi\)\\displaystyle=\\frac\{1\}\{\\sum\_\{i=1\}^\{G\}H\_\{i\}\}\\sum\_\{i=1\}^\{G\}\\sum\_\{t=1\}^\{H\_\{i\}\}r\_\{i,t\}=\\frac\{1\}\{\\sum\_\{i=1\}^\{G\}H\_\{i\}\}\\sum\_\{i=1\}^\{G\}\\sum\_\{t=1\}^\{H\_\{i\}\}\\gamma^\{H\_\{i\}\-t\}\\mathcal\{R\}\(\\tau\_\{i\}\)=1∑i=1GHi​∑i=1Gℛ​\(τi\)​∑t=1HiγHi−t≈1G​H​∑i=1Gℛ​\(τi\)​∑t=1HγH−t\\displaystyle=\\frac\{1\}\{\\sum\_\{i=1\}^\{G\}H\_\{i\}\}\\sum\_\{i=1\}^\{G\}\\mathcal\{R\}\(\\tau\_\{i\}\)\\sum\_\{t=1\}^\{H\_\{i\}\}\\gamma^\{H\_\{i\}\-t\}\\approx\\frac\{1\}\{GH\}\\sum\_\{i=1\}^\{G\}\\mathcal\{R\}\(\\tau\_\{i\}\)\\sum\_\{t=1\}^\{H\}\\gamma^\{H\-t\}
We further define

Cγ=1H​∑t=1HγH−t=1−γHH​\(1−γ\),γ∈\(0,1\)C\_\{\\gamma\}=\\frac\{1\}\{H\}\\sum\_\{t=1\}^\{H\}\\gamma^\{H\-t\}=\\frac\{1\-\\gamma^\{H\}\}\{H\(1\-\\gamma\)\},\\qquad\\gamma\\in\(0,1\)\(4\)which is the average temporal discount coefficient within a trajectory\. Becauseγ∈\(0,1\)\\gamma\\in\(0,1\), it immediately follows that0<Cγ<10<C\_\{\\gamma\}<1\. LetSSdenotes the average trajectory\-level return in the current rollout group, we haveS=1G​∑i=1Gℛ​\(τi\)S=\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\mathcal\{R\}\(\\tau\_\{i\}\)\. Then the mean reward admits the compact formμ≈Cγ​S\\mu\\approx C\_\{\\gamma\}S\.

For clarity, we analyze the centered rewardri,t−μr\_\{i,t\}\-\\muand omit the standard\-deviation termσ\\sigma, which acts as a shared positive scaling factor within each rollout group and thus does not affect the preference\. Under this simplification, the centered reward for each decomposed state\-action sample becomes

A^i,t=ri,t−μ=γHi−t​ℛ​\(τi\)−Cγ​S\\hat\{A\}\_\{i,t\}=r\_\{i,t\}\-\\mu=\\gamma^\{H\_\{i\}\-t\}\\mathcal\{R\}\(\\tau\_\{i\}\)\-C\_\{\\gamma\}S\(5\)
Substituting Equation[5](https://arxiv.org/html/2605.20246#S3.E5)into the objective in Equation[3](https://arxiv.org/html/2605.20246#S3.E3)\(clipping is omitted for brevity\) gives

𝒥\\displaystyle\\mathcal\{J\}=𝔼c∼𝒞a∼πold\(⋅∣s\)​\[1G​∑i=1G1Hi​∑t=1Hiρi,t​\(γHi−t​ℛ​\(τi\)−Cγ​S\)\]\\displaystyle=\\underset\{\\begin\{subarray\}\{c\}c\\sim\\mathcal\{C\}\\\\ a\\sim\\pi\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\end\{subarray\}\}\{\\mathbb\{E\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{H\_\{i\}\}\\sum\_\{t=1\}^\{H\_\{i\}\}\\rho\_\{i,t\}\\left\(\\gamma^\{H\_\{i\}\-t\}\\mathcal\{R\}\(\\tau\_\{i\}\)\-C\_\{\\gamma\}S\\right\)\\right\]=𝔼c∼𝒞a∼πold\(⋅∣s\)​\[1G​∑i=1G1Hi​∑t=1HiCγ​ρi,t​\(ℛ​\(τi\)−S\)\]\\displaystyle=\\underset\{\\begin\{subarray\}\{c\}c\\sim\\mathcal\{C\}\\\\ a\\sim\\pi\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\end\{subarray\}\}\{\\mathbb\{E\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{H\_\{i\}\}\\sum\_\{t=1\}^\{H\_\{i\}\}C\_\{\\gamma\}\\rho\_\{i,t\}\\bigl\(\\mathcal\{R\}\(\\tau\_\{i\}\)\-S\\bigr\)\\right\]\+𝔼c∼𝒞a∼πold\(⋅∣s\)​\[1G​∑i=1G1Hi​∑t=1Hiρi,t​\(γHi−t−Cγ\)​ℛ​\(τi\)\]\\displaystyle\\quad\+\\underset\{\\begin\{subarray\}\{c\}c\\sim\\mathcal\{C\}\\\\ a\\sim\\pi\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\end\{subarray\}\}\{\\mathbb\{E\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{H\_\{i\}\}\\sum\_\{t=1\}^\{H\_\{i\}\}\\rho\_\{i,t\}\\left\(\\gamma^\{H\_\{i\}\-t\}\-C\_\{\\gamma\}\\right\)\\mathcal\{R\}\(\\tau\_\{i\}\)\\right\]\(6\)
For convenience, we define that

𝒥traj=𝔼c∼𝒞a∼πold\(⋅∣s\)​\[1G​∑i=1G1Hi​∑t=1Hiρi,t​\(ℛ​\(τi\)−S\)\]\\mathcal\{J\}\_\{\\mathrm\{traj\}\}=\\underset\{\\begin\{subarray\}\{c\}c\\sim\\mathcal\{C\}\\\\ a\\sim\\pi\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\end\{subarray\}\}\{\\mathbb\{E\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{H\_\{i\}\}\\sum\_\{t=1\}^\{H\_\{i\}\}\\rho\_\{i,t\}\\bigl\(\\mathcal\{R\}\(\\tau\_\{i\}\)\-S\\bigr\)\\right\]\(7\)𝒥step=𝔼c∼𝒞a∼πold\(⋅∣s\)​\[1G​∑i=1G1Hi​∑t=1Hiρi,t​\(γHi−t−Cγ\)​ℛ​\(τi\)\]\\mathcal\{J\}\_\{\\mathrm\{step\}\}=\\underset\{\\begin\{subarray\}\{c\}c\\sim\\mathcal\{C\}\\\\ a\\sim\\pi\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\end\{subarray\}\}\{\\mathbb\{E\}\}\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{H\_\{i\}\}\\sum\_\{t=1\}^\{H\_\{i\}\}\\rho\_\{i,t\}\\left\(\\gamma^\{H\_\{i\}\-t\}\-C\_\{\\gamma\}\\right\)\\mathcal\{R\}\(\\tau\_\{i\}\)\\right\]\(8\)
Then Equation[3\.3](https://arxiv.org/html/2605.20246#S3.Ex4)can be rewritten as𝒥=Cγ​𝒥traj\+𝒥step\\mathcal\{J\}=C\_\{\\gamma\}\\mathcal\{J\}\_\{\\mathrm\{traj\}\}\+\\mathcal\{J\}\_\{\\mathrm\{step\}\}\. On the one hand,𝒥traj\\mathcal\{J\}\_\{\\mathrm\{traj\}\}preserves the trajectory\-level relative preference because the update signal still depends on the trajectory rewardℛ​\(τi\)−S\\mathcal\{R\}\(\\tau\_\{i\}\)\-S\. Therefore, it reinforces successful trajectories at the trajectory level\. On the other hand,𝒥step\\mathcal\{J\}\_\{\\mathrm\{step\}\}captures temporal discounting effects only on successful trajectories\. As a result, this term refines credit assignment within successful trajectories by emphasizing decisions that are more directly related to the final success\.

Therefore, although the samples in each group are conditioned on different local states rather than an identical prompt context, the objective remains effective: it still compares trajectories through their relative rewards while further refining the update signal at the step level\.

## 4Experiments

### 4\.1Experimental Setup

EnvironmentWe conduct experiments in Minecraft \(Java Edition, v1\.16\.5\)\. In Minecraft, the agent’s observation space is strictly limited to first\-person view RGB images with a resolution of360×640×3360\\times 640\\times 3, without access to any auxiliary state information\. The action space consists of discretized human\-like interface commands, including mouse movements, clicks, and keyboard inputs, to simulate authentic human gameplay\.

Benchmarks and Task CategorizationTo comprehensively assess the agent’s multimodal understanding and fine\-grained manipulation capabilities, we use the MCU benchmark\[[12](https://arxiv.org/html/2605.20246#bib.bib37)\], which includes over 800 tasks\. The benchmark comprises distinct task categories, each targeting specific skill sets: \(1\)Embodied tasks: The agent needs to navigate to the position of the target blocks and then use tools to mine or chop blocks\. This task focuses on 3D spatial awareness and navigation\. Agents must identify target textures within a complex voxel environment and utilize specific tools to excavate target blocks\. \(2\)GUI tasks: The agent is requested to create the target items with a furnace or crafting table\. This task tests precise 2D grid manipulation and logical reasoning to convert materials\. It further introduces temporal constraints, requiring agents to manage wait times during smelting processes\. \(3\)Combat tasks: The agent needs to kill the target entities\. This task focuses on adversarial strategies in highly dynamic environments\. Unlike static tasks, this requires the agent to track and defeat actively hostile moving targets\.

Evaluation MetricsWe follow\[[22](https://arxiv.org/html/2605.20246#bib.bib35),[29](https://arxiv.org/html/2605.20246#bib.bib53)\]and employ three primary metrics to assess the performance and efficiency of the agents across different task categories: \(1\)Steps: The average number of interaction steps required to finish a task\. \(2\)ASR: The overall average success rate over all tasks in that category, measuring the comprehensive task completion capability of the model\. Each task is evaluated over a minimum of 3 independent episodes to ensure reliable performance estimates\. To accurately reflect performance variance, the ASR metrics are reported as the mean value alongside their standard deviation\.

### 4\.2Implementation Details

Given that most VLM agents in Table[1](https://arxiv.org/html/2605.20246#S4.T1)are built upon Qwen2\-VL\-7B\-Instruct\[[21](https://arxiv.org/html/2605.20246#bib.bib61)\], we also adopt it as the base model for our agent to ensure a fair comparison\. We utilize 3M state\-action samples to initialize the VLM agent using LLaMA\-Factory framework\[[28](https://arxiv.org/html/2605.20246#bib.bib41)\]with 8 H200 GPUs for about 3 days\. We select 8 tasks for RL training to cover all three task categories: 2 embodied tasks \(mine block diorite,mine birch log\), 4 GUI tasks \(craft furnace,craft iron pickaxe,smelt item cooked porkchop,smelt gold ingot\) and 2 combat tasks \(kill skeleton,kill blaze\)\. Task success is automatically determined by a built\-in environment verifier, which monitors game state transitions such as inventory changes and entity defeat events to produce binary rewards without human intervention\. We trained our model in a total different world in Minecraft by selecting world seeds different from those used in evaluation\. The RL pipeline is implemented upon the open\-source verl framework\[[18](https://arxiv.org/html/2605.20246#bib.bib51)\]to facilitate efficient distributed optimization\. We conduct all training procedures on a unified compute node equipped with 8 H200 GPUs\. The model undergoes optimization for a total of 240 iterations for about 5 days with a discount factor ofγ=0\.995\\gamma=0\.995to ensure stable and effective convergence across all evaluated tasks\. More details about the dataset for policy initialization and hyperparameters are in Appendix[C](https://arxiv.org/html/2605.20246#A3)and[D](https://arxiv.org/html/2605.20246#A4)\.

### 4\.3Main Results

Table 1:Evaluation results of Minecraft agents on over 800 tasks\. ASR is reported as the mean value alongside its standard deviation\. The best performance is marked inbold, and the second\-best performance isunderlined\.Embodied TasksGUI TasksCombat TasksModelSizeStepsASR \(All\)StepsASR \(All\)StepsASR \(All\)VPT\[[3](https://arxiv.org/html/2605.20246#bib.bib6)\]248M3776\.0±11\.43980\.8±3\.33963\.6±7\.7STEVE\-1\[[11](https://arxiv.org/html/2605.20246#bib.bib24)\]248M3848\.0±17\.03913\.2±8\.43953\.9±12\.0ROCKET\-1\[[6](https://arxiv.org/html/2605.20246#bib.bib52)\]72B39218\.9±24\.3N/A0\.032027\.9±29\.3JARVIS\-VLA\[[9](https://arxiv.org/html/2605.20246#bib.bib7)\]7B30530\.0±35\.433925\.1±23\.935218\.5±22\.7UI\-TARS\-1\.5\[[19](https://arxiv.org/html/2605.20246#bib.bib65)\]7B29042\.1±20\.432036\.7±17\.234631\.0±16\.4OpenHA\[[22](https://arxiv.org/html/2605.20246#bib.bib35)\]7B28730\.1±13\.931432\.5±9\.231631\.9±13\.7Game\-TARS\[[23](https://arxiv.org/html/2605.20246#bib.bib43)\]7B37350\.4±20\.740639\.1±27\.537238\.1±4\.6MAIN\-VLA\[[29](https://arxiv.org/html/2605.20246#bib.bib53)\]7B26332\.8±15\.429134\.4±14\.424839\.2±16\.2Ours7B12859\.6±37\.224868\.4±29\.017249\.0±35\.7

Table[1](https://arxiv.org/html/2605.20246#S4.T1)shows that GROW achieves the best performance across all three Minecraft task categories\. In embodied tasks, our agent improves the previous best ASR from 50\.4% to 59\.6%, with a gain of 9\.2 percentage points\. This improvement suggests that GROW strengthens the agent’s ability to navigate toward target objects after detection and to persistently use tools until the task is completed\. In GUI tasks, our RL framework yields the largest gain, improving ASR from 39\.1% to 68\.4%\. This gain mainly comes from improved recognition of target items and better mastery of the key interaction procedures in crafting and smelting\. In combat tasks, our agent improves ASR from 39\.2% to 49\.0%, indicating stronger decision\-making in highly dynamic environments where the agent must continuously track enemies, adjust its position, and select appropriate actions under changing visual states\.

Besides improving success rates, GROW also substantially reduces the number of steps required to complete tasks\. Compared with the previous best step counts, our agent reduces the average steps from 263 to 128 on embodied tasks, from 291 to 248 on GUI tasks, and from 248 to 172 on combat tasks\. These results show that GROW not only teaches the agent how to accomplish diverse Minecraft tasks, but also improves execution efficiency, leading to more direct and effective task completion\.

Table 2:Success rates before and after RL on in\-domain and out\-domain Minecraft tasks\. Success rates are reported as the mean value alongside their standard deviation\.RL\-TrainedRL\-UnseenModelEmbodied TasksGUI TasksCombat TasksEmbodied TasksGUI TasksCombat TasksInitialized Policy35\.0±7\.110\.0±8\.245\.0±21\.241\.9±34\.721\.3±24\.819\.8±24\.4Ours90\.0±14\.192\.5±5\.085\.0±21\.259\.2±37\.468\.2±29\.147\.5±35\.9

### 4\.4Generalization to Unseen Tasks via GROW

As shown in Table[2](https://arxiv.org/html/2605.20246#S4.T2), GROW substantially improves the initialized policy on both RL\-trained and RL\-unseen tasks\. On RL\-trained tasks, the success rate increases from 35\.0% to 90\.0% on embodied tasks, from 10\.0% to 92\.5% on GUI tasks, and from 45\.0% to 85\.0% on combat tasks\. More importantly, these gains are not limited to tasks used during RL, suggesting that the learned policy refinement is not merely task\-specific adaptation\. On RL\-unseen tasks, GROW also improves the success rate from 41\.9% to 59\.2% on embodied tasks, from 21\.3% to 68\.2% on GUI tasks, and from 19\.8% to 47\.5% on combat tasks\. Since these tasks are excluded from RL training, the consistent improvements provide direct evidence that GROW strengthens reusable interaction skills rather than memorizing training trajectories\. These results indicate that the agent trained with GROW does not overfit to the RL\-trained task set\. Instead, it acquires transferable capabilities that generalize to unseen Minecraft tasks across different categories\.

### 4\.5Abalation Study

![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/combined_baseline_success_rates.png)Figure 3:Performance comparison between the PPO baseline and GROW, our proposed RL framework, across eight training tasks\. The learning curves show that GROW outperforms PPO by achieving significantly higher success rates and faster convergence in most tasks, highlighting the effectiveness of GROW as an RL framework\.Comparision with PPOProximal Policy Optimization \(PPO\) is one of the most widely used RL algorithms and has been extensively adopted before the introduction of GRPO\. To examine whether GROW provides stronger policy optimization than this classical RL baseline, we train the initialized policy with PPO and GROW under the same training budget, respectively, and compare their performance in Table[3](https://arxiv.org/html/2605.20246#S4.T3)\. The results show that GROW improves success rates more effectively on RL\-trained tasks and also provides stronger generalization to RL\-untrained tasks than PPO\.

To further analyze the training dynamics, we compare the success rate curves of GROW and PPO\. Figure[3](https://arxiv.org/html/2605.20246#S4.F3)reports the success rate curves over the tasks used for RL training\. Compared with PPO, GROW improves the success rate faster on most tasks and reaches convergence earlier\. These learning curves indicate that GROW shows faster and more stable policy improvement than PPO across most evaluated tasks\. GROW typically reaches higher success rates earlier in training and maintains a smoother upward trajectory, while PPO often improves more slowly or stays in a low\-success plateau for many training steps\. The advantage is most evident on tasks with delayed procedural rewards and multi\-step interaction requirements, where GROW continues to yield measurable gains as training progresses\. For relatively easier tasks, both methods can eventually approach high success rates, but GROW generally reaches this region earlier\.

Table 3:Ablation study on RL algorithms and the discount factorγ\\gamma\. All experiments are trained on the same task set for the same number of training steps to ensure a fair comparison\.Embodied TasksGUI TasksCombat TasksMethodγ\\gammaStepsASR \(All\)StepsASR \(All\)StepsASR \(All\)PPO0\.99517758\.2±30\.358\.2^\{\\pm 30\.3\}58043\.9±24\.343\.9^\{\\pm 24\.3\}18145\.4±32\.245\.4^\{\\pm 32\.2\}GROW0\.914242\.0±35\.242\.0^\{\\pm 35\.2\}29234\.9±34\.434\.9^\{\\pm 34\.4\}17045\.1±37\.845\.1^\{\\pm 37\.8\}GROW0\.9512259\.0±25\.930962\.0±20\.616746\.6±34\.546\.6^\{\\pm 34\.5\}GROW0\.99512859\.6±37\.224868\.4±29\.017249\.0±35\.7

Abalation on the Discount FactorOur RL training framework includes a hyperparameterγ\\gamma, which controls how strongly the task\-success signal decays when it is propagated backward across interaction steps\. Table[3](https://arxiv.org/html/2605.20246#S4.T3)evaluates how different values ofγ\\gammaaffect performance, thereby revealing the trade\-off between local credit assignment and long\-range task completion\. For tasks that require relatively few steps, moderately reducingγ\\gammadecreases the value assigned to early steps in overly long trajectories, which suppresses weakly relevant actions and encourages the model to learn more efficient execution strategies\. In contrast, for tasks that require many interaction steps, an excessively smallγ\\gammamakes the model too short\-sighted by weakening useful credit signals from early but necessary actions, and therefore hurts performance\.

### 4\.6Behavior\-Level Analysis

We conduct a behavior\-level analysis on three diagnostic tasks in Minecraft to examine how GROW changes the learned policy\. Each task isolates a specific interaction skill and tests whether the state\-action formulation of GROW can convert sparse task\-level rewards into local policy improvements\.

Mine Obsidian with Iron PickaxeThis task demonstrates stable target fixation\. Breaking obsidian with an iron pickaxe requires sustained crosshair alignment over a long mining process, rather than a single correct action\. The baselines cannot complete the task, while GROW reaches 63\.3% success\. This suggests that GROW improves precise control across consecutive interaction steps\. By decomposing the mining trajectory into state\-action samples, GROW reinforces local actions that maintain effective interaction until the block is destroyed\.

![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/diagnostic_task_success_rate.png)Figure 4:Success rates on diagnostic tasks\. GROW improves stable target fixation, active target reacquisition, and distractor\-robust GUI operation\.Kill Witch Out of SightThis task demonstrates active target reacquisition\. Since the witch starts outside the field of view, the agent must search, recover visual contact, and continue pursuit as the target moves away\. GROW achieves 50\.0% success, whereas both baselines fail completely\. This indicates that GROW strengthens not only attack behavior but also camera reorientation, visual tracking, approach, and sustained engagement\. We attribute this improvement to the reward assignment induced by trajectory decomposition: trajectories that contain active target\-searching behaviors are more likely to achieve final success, so the corresponding action segments receive higher propagated rewards and are repeatedly reinforced during training\. As a result, GROW helps the policy acquire the skill of actively searching for and reacquiring targets when visual contact is lost\.

Craft Cake with DistractionsThis task demonstrates distractor\-robust GUI operation\. The recipe book and inventory contain many irrelevant items, requiring the agent to select the correct cake recipe under visual clutter\. GROW reaches 50\.0% success, compared with 40\.0% for JARVIS\-VLA and 2\.0% for the initialized policy\. Although JARVIS\-VLA captures part of the recipe\-book interaction pattern, GROW improves precise recipe selection under distraction\. We attribute this improvement to the discounted trajectory reward used in GROW, which encourages the policy to discover more efficient execution paths during training\. This signal makes the agent focus more consistently on interface elements that are most relevant to task completion, even when many distractors are present, thereby improving its robustness to visual clutter and increasing the final success rate\.

Overall, these diagnostic tasks show that GROW improves stable target fixation, active target reacquisition, and distractor\-robust GUI operation\. The gains come from optimizing meaningful local decisions within long multi\-turn trajectories\.

## 5Limitations

Regarding limitations, our current agent mainly performs atomic tasks, lacking a memory module\. For long\-horizon tasks requiring memory, future work will explore summarizing history after several steps\. Additionally, proprietary high\-fidelity open\-world simulators constrain extending our RL framework to broader training environments\. Expanding to diverse accessible environments may further increase the agent’s skill repertoire and general applicability\.

## 6Conclusion

We propose GROW, a RL framework for open\-world VLM agents\. GROW first decomposes trajectories into state\-action samples and compute relative advantages between the state\-action samples\. By avoiding full\-trajectory samples, this design mitigates excessive context accumulation while preserving the relative policy improvement principle of GRPO\. We also provide surrogate analysis showing that the proposed formulation remains effective when samples in the same optimization group are conditioned on different short\-horizon states\. Experiments on more than 800 Minecraft tasks show that GROW achieves SOTA performance, demonstrating the effectiveness of our RL framework for open\-world VLM agents\. Across embodied, GUI, and combat tasks, GROW improves both success rate and execution efficiency, while also showing generalization to unseen tasks and fostering reusable interaction skills\.

## References

- \[1\]Q\. Ai, P\. Bu, Y\. Cao, Y\. Wang, J\. Gu, J\. Xing, Z\. Zhu, W\. Jiang, Z\. Zheng, J\. Song, Y\. Jiang, and B\. Zheng\(2025\-08\)InquireMobile: teaching vlm\-based mobile agent to request human assistance via reinforcement fine\-tuning\.CoRRabs/2508\.19679\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.19679)Cited by:[§2\.2](https://arxiv.org/html/2605.20246#S2.SS2.p1.1)\.
- \[2\]S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang, H\. Zhong, Y\. Zhu, M\. Yang, Z\. Li, J\. Wan, P\. Wang, W\. Ding, Z\. Fu, Y\. Xu, J\. Ye, X\. Zhang, T\. Xie, Z\. Cheng, H\. Zhang, Z\. Yang, H\. Xu, and J\. Lin\(2025\)Qwen2\.5\-vl technical report\.External Links:2502\.13923,[Link](https://arxiv.org/abs/2502.13923)Cited by:[§E\.2](https://arxiv.org/html/2605.20246#A5.SS2.p1.1),[3rd item](https://arxiv.org/html/2605.20246#S1.I1.i3.p1.1)\.
- \[3\]B\. Baker, I\. Akkaya, P\. Zhokov, J\. Huizinga, J\. Tang, A\. Ecoffet, B\. Houghton, R\. Sampedro, and J\. Clune\(2022\)Video pretraining \(vpt\): learning to act by watching unlabeled online videos\.Advances in Neural Information Processing Systems35,pp\. 24639–24654\.Cited by:[Appendix C](https://arxiv.org/html/2605.20246#A3.p2.1),[Appendix C](https://arxiv.org/html/2605.20246#A3.p4.1),[§2\.1](https://arxiv.org/html/2605.20246#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.20246#S4.T1.3.3.3.4)\.
- \[4\]S\. Cai, Z\. Mu, H\. Xia, B\. Zhang, A\. Liu, and Y\. Liang\(2025\)Scalable multi\-task reinforcement learning for generalizable spatial intelligence in visuomotor agents\.arXiv preprint arXiv:2507\.23698\.Cited by:[§2\.1](https://arxiv.org/html/2605.20246#S2.SS1.p1.1)\.
- \[5\]S\. Cai, Z\. Wang, K\. Lian, Z\. Mu, X\. Ma, A\. Liu, and Y\. Liang\(2025\)ROCKET\-1: mastering open\-world interaction with visual\-temporal context prompting\.In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Vol\.,pp\. 12122–12131\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52734.2025.01132)Cited by:[§2\.1](https://arxiv.org/html/2605.20246#S2.SS1.p1.1)\.
- \[6\]S\. Cai, Z\. Wang, K\. Lian, Z\. Mu, X\. Ma, A\. Liu, and Y\. Liang\(2025\)Rocket\-1: mastering open\-world interaction with visual\-temporal context prompting\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 12122–12131\.Cited by:[Table 1](https://arxiv.org/html/2605.20246#S4.T1.8.8.8.3)\.
- \[7\]Y\. Hu, A\. Xi, Q\. Xiao, S\. Isaacson, H\. X\. Liu, R\. Vasudevan, and M\. Ghaffari\(2026\)LongNav\-r1: horizon\-adaptive multi\-turn rl for long\-horizon vla navigation\.External Links:2602\.12351,[Link](https://arxiv.org/abs/2602.12351)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p2.1)\.
- \[8\]L\. Li, J\. Zhao, Y\. Xie, X\. Tan, and X\. Li\(2026\)CompassNav: steering from path imitation to decision understanding in navigation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=eqcDckWHik)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p2.1)\.
- \[9\]M\. Li, Z\. Wang, K\. He, X\. Ma, and Y\. Liang\(2025\-07\)JARVIS\-VLA: post\-training large\-scale vision language models to play visual games with keyboards and mouse\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 17878–17899\.External Links:[Link](https://aclanthology.org/2025.findings-acl.920/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.920),ISBN 979\-8\-89176\-256\-5Cited by:[Appendix A](https://arxiv.org/html/2605.20246#A1.p1.2),[Appendix C](https://arxiv.org/html/2605.20246#A3.p2.1),[3rd item](https://arxiv.org/html/2605.20246#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.20246#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.20246#S4.T1.11.11.11.4)\.
- \[10\]N\. Li, Q\. Lin, Z\. Wu, X\. Mo, W\. Zhang, Y\. Zhao, X\. Qu, J\. Zhou, J\. Wang, C\. Zheng, Y\. Song, H\. Chen, H\. Huang, J\. Wang, J\. Yin, J\. Yu, J\. Liao, Q\. Peng, X\. Lou, J\. Wang, W\. Liu, Z\. Zhang, and W\. Zhang\(2025\)ColorAgent: building a robust, personalized, and interactive os agent\.External Links:2510\.19386,[Link](https://arxiv.org/abs/2510.19386)Cited by:[§2\.2](https://arxiv.org/html/2605.20246#S2.SS2.p1.1)\.
- \[11\]S\. Lifshitz, K\. Paster, H\. Chan, J\. Ba, and S\. McIlraith\(2023\)Steve\-1: a generative model for text\-to\-behavior in minecraft\.Advances in Neural Information Processing Systems36,pp\. 69900–69929\.Cited by:[§2\.1](https://arxiv.org/html/2605.20246#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.20246#S4.T1.6.6.6.4)\.
- \[12\]H\. Lin, Z\. Wang, J\. Ma, and Y\. Liang\(2023\)Mcu: a task\-centric framework for open\-ended agent evaluation in minecraft\.arXiv preprint arXiv:2310\.08367\.Cited by:[3rd item](https://arxiv.org/html/2605.20246#S1.I1.i3.p1.1),[§4\.1](https://arxiv.org/html/2605.20246#S4.SS1.p2.1)\.
- \[13\]Z\. Luo, W\. Yan, J\. Gong, M\. Wang, Z\. Zhang, X\. Wang, Y\. Xie, and X\. Tan\(2025\)Navimaster: learning a unified policy for gui and embodied navigation tasks\.arXiv preprint arXiv:2508\.02046\.Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p2.1)\.
- \[14\]C\. Lynch, A\. Wahid, J\. Tompson, T\. Ding, J\. Betker, R\. Baruch, T\. Armstrong, and P\. Florence\(2022\)Interactive language: talking to robots in real time\.External Links:2210\.06407,[Link](https://arxiv.org/abs/2210.06407)Cited by:[§E\.1](https://arxiv.org/html/2605.20246#A5.SS1.p2.1),[Appendix E](https://arxiv.org/html/2605.20246#A5.p1.1)\.
- \[15\]L\. Magne, A\. Awadalla, G\. Wang, Y\. Xu, J\. Belofsky, F\. Hu, J\. Kim, L\. Schmidt, G\. Gkioxari, J\. Kautz, Y\. Yue, Y\. Choi, Y\. Zhu, and L\. "\. Fan\(2026\)NitroGen: an open foundation model for generalist gaming agents\.External Links:2601\.02427,[Link](https://arxiv.org/abs/2601.02427)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.20246#S2.SS1.p1.1)\.
- \[16\]M\. Ouyang, S\. Hu, K\. Q\. Lin, H\. T\. Ng, and M\. Z\. Shou\(2026\)GameWorld: towards standardized and verifiable evaluation of multimodal game agents\.External Links:[Link](https://api.semanticscholar.org/CorpusID:287255688)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1)\.
- \[17\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p2.1)\.
- \[18\]G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu\(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[§4\.2](https://arxiv.org/html/2605.20246#S4.SS2.p1.1)\.
- \[19\]Y\. Shi, W\. Yu, Z\. Li, Y\. Wang, H\. Zhang, N\. Liu, H\. Mi, and D\. Yu\(2025\)MobileGUI\-rl: advancing mobile gui agent through reinforcement learning in online environment\.External Links:2507\.05720,[Link](https://arxiv.org/abs/2507.05720)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[§1](https://arxiv.org/html/2605.20246#S1.p2.1),[Table 1](https://arxiv.org/html/2605.20246#S4.T1.14.14.14.4)\.
- \[20\]W\. Tan, X\. Li, Y\. Fang, H\. Yao, S\. Yan, H\. Luo, T\. Ao, H\. Li, H\. Ren, B\. Yi, Y\. Qin, B\. An, L\. Liu, and G\. Shi\(2025\)Lumine: an open recipe for building generalist agents in 3d open worlds\.External Links:2511\.08892,[Link](https://arxiv.org/abs/2511.08892)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.20246#S2.SS1.p1.1)\.
- \[21\]P\. Wang, S\. Bai, S\. Tan, S\. Wang, Z\. Fan, J\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, Y\. Fan, K\. Dang, M\. Du, X\. Ren, R\. Men, D\. Liu, C\. Zhou, J\. Zhou, and J\. Lin\(2024\)Qwen2\-vl: enhancing vision\-language model’s perception of the world at any resolution\.External Links:2409\.12191,[Link](https://arxiv.org/abs/2409.12191)Cited by:[§4\.2](https://arxiv.org/html/2605.20246#S4.SS2.p1.1)\.
- \[22\]Z\. Wang, M\. Li, K\. He, X\. Wang, Z\. Mu, A\. Liu, and Y\. Liang\(2025\)Openha: a series of open\-source hierarchical agentic models in minecraft\.arXiv preprint arXiv:2509\.13347\.Cited by:[§4\.1](https://arxiv.org/html/2605.20246#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20246#S4.T1.17.17.17.4)\.
- \[23\]Z\. Wang, X\. Li, Y\. Ye, J\. Fang, H\. Wang, L\. Liu, S\. Liang, J\. Lu, Z\. Wu, J\. Feng,et al\.\(2025\)Game\-tars: pretrained foundation models for scalable generalist multimodal game agents\.arXiv preprint arXiv:2510\.23691\.Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[Table 1](https://arxiv.org/html/2605.20246#S4.T1.20.20.20.4)\.
- \[24\]Z\. Xi, Y\. Ding, W\. Chen, B\. Hong, H\. Guo, J\. Wang, X\. Guo, D\. Yang, C\. Liao, W\. He, S\. Gao, L\. Chen, R\. Zheng, Y\. Zou, T\. Gui, Q\. Zhang, X\. Qiu, X\. Huang, Z\. Wu, and Y\. Jiang\(2025\)AgentGym: evaluating and training large language model\-based agents across diverse environments\.InAnnual Meeting of the Association for Computational Linguistics,External Links:[Link](https://api.semanticscholar.org/CorpusID:280017701)Cited by:[§2\.2](https://arxiv.org/html/2605.20246#S2.SS2.p1.1)\.
- \[25\]S\. Ye, S\. Mao, Y\. Cui, X\. Yu, S\. Zhai, W\. Chen, S\. Zhou, R\. Xiong, and Y\. Wang\(2025\)ETP\-r1: evolving topological planning with reinforcement fine\-tuning for vision\-language navigation in continuous environments\.External Links:2512\.20940,[Link](https://arxiv.org/abs/2512.20940)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[§1](https://arxiv.org/html/2605.20246#S1.p2.1)\.
- \[26\]H\. Zhang, X\. Liu, B\. Lv, X\. Sun, B\. Jing, I\. L\. Iong, Z\. Hou, Z\. Qi, H\. Lai, Y\. Xu,et al\.\(2025\)AgentRL: scaling agentic reinforcement learning with a multi\-turn, multi\-task framework\.arXiv preprint arXiv:2510\.04206\.Cited by:[§2\.2](https://arxiv.org/html/2605.20246#S2.SS2.p1.1)\.
- \[27\]Z\. Zhang, W\. Zhu, H\. Pan, X\. Wang, R\. Xu, X\. Sun, and F\. Zheng\(2025\)ActiveVLN: towards active exploration via multi\-turn rl in vision\-and\-language navigation\.External Links:2509\.12618,[Link](https://arxiv.org/abs/2509.12618)Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[§1](https://arxiv.org/html/2605.20246#S1.p2.1)\.
- \[28\]Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, Z\. Luo, Z\. Feng, and Y\. Ma\(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Bangkok, Thailand\.External Links:[Link](http://arxiv.org/abs/2403.13372)Cited by:[§4\.2](https://arxiv.org/html/2605.20246#S4.SS2.p1.1)\.
- \[29\]Z\. Zhou, L\. Du, Z\. Sun, X\. Zhou, R\. Ye, Q\. Chen, Y\. Chen, and L\. Qiu\(2026\)MAIN\-vla: modeling abstraction of intention and environment for vision\-language\-action models\.arXiv preprint arXiv:2602\.02212\.Cited by:[§1](https://arxiv.org/html/2605.20246#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.20246#S4.SS1.p3.1),[Table 1](https://arxiv.org/html/2605.20246#S4.T1.23.23.23.4)\.

## Appendix AEnvironment and Action Space

Observation Space\.Our model operates under a strict pixel\-only constraint, perceiving the environment solely through raw640×360640\\times 360RGB frames without access to any internal game states, coordinates, or metadata\. Following the protocol established in JARVIS\-VLA\[[9](https://arxiv.org/html/2605.20246#bib.bib7)\], these observations are intended to mirror the authentic experience of a human player; as such, we do not remove or mask any standard on\-screen overlays\. The visual input includes all native HUD elements—such as the hotbar, health and hunger indicators, and the dynamic hand\-swing animations triggered by interactions—forcing the agent to interpret the environment and its own status purely from raw sensory data\. To maintain this human\-centric perspective, we employ a standard70∘70^\{\\circ\}Field of View and a GUI scale of 2, ensuring the visual distribution remains consistent with typical gameplay\.

Action Space\.The action space is designed to be both atomic and expressive, mapping directly to the fundamental keyboard and mouse inputs available to a human\. Rather than employing high\-level macro\-actions or simplified APIs, we decompose player behavior into granular primitives\. This includes binary operations for movement and interaction \(e\.g\., sprinting, jumping, and attacking\) alongside continuous controls for camera orientation \(pitch and yaw\)\. By relying on these basic building blocks, the agent is required to learn the composition of complex, multi\-step strategies from the ground up\. The full set of these operations is summarized in Table[4](https://arxiv.org/html/2605.20246#A1.T4)\. In inference, we request agents to predict an action chunk with 4 interaction steps according to their current observation\.

Table 4:Mapping binary primitives to the standard Minecraft controls\.ActionHuman actionDescriptionforwardW keyMove forward\.backS keyMove backward\.leftA keyStrafe left\.rightD keyStrafe right\.jumpSpace keyJump\.sneakleft Shift keySwitch to a slow walking mode\.sprintleft Ctrl keySwitch to rapid walking mode\.attackleft ButtonDestroy blocks \(hold down\); Attack entity \(click once\)\.useright ButtonPlace the item currently held or use the block the player is looking at\.dropQ keyDrop a single item from the stack of items the player is currently holding\.hotbar\.\[1\-9\]keys 1\-9Switch active item to the one in a given hotbar cell\.inventoryE keyOpen/Close the inventory\.yawmove Mouse Xcamera movement\.pitchmove Mouse Ycamera movement\.
## Appendix BStatistic Evidences

Figure[5](https://arxiv.org/html/2605.20246#A2.F5)shows the step distribution across tasks\. We observe that trajectories for the same task have similar numbers of execution steps\. For failed tasks, the early termination stage also keeps trajectory lengths as close as possible to those of successful executions\.

![Refer to caption](https://arxiv.org/html/2605.20246v1/x1.png)Figure 5:Step count histograms over ten rollouts for eight RL tasks\.
## Appendix CDataset Construction

In our training framework, the cold\-start stage is the only phase necessitating external data collection\. Conversely, the next stage, multi\-turn RL with GRPO, relies solely on predefined scenarios and resource configurations for each task, eliminating the need for additional data acquisition\.

We first adopted the approach from JARVIS\-VLA\[[9](https://arxiv.org/html/2605.20246#bib.bib7)\]by utilizing a dataset rich in world knowledge for supervised fine\-tuning\. This component comprises 25k visual question answering samples, 225k grounding samples, and 9k captioning samples\. Subsequently, we performed detailed processing on the open\-source human player trajectory dataset from VPT\[[3](https://arxiv.org/html/2605.20246#bib.bib6)\]to construct an instruction\-annotated action dataset\.

Labeling and Segmentation We identified task completion based on transitions in the player’s state recorded within the dataset\. For instance, if the player’s inventory showed the acquisition of an iron ore relative to the previous state, we labeled the preceding trajectory segment with the instruction to "mine iron ore\." We defined the segment duration as the interval starting from the completion of the previous task to the moment the current state change occurred\. This logic applies similarly to other interactions, such as labeling the interval preceding the defeat of a creature or the acquisition of a specific item with the corresponding objective\.

Cleaning and Chunking Following the data cleaning protocol in VPT\[[3](https://arxiv.org/html/2605.20246#bib.bib6)\], we filtered the dataset to remove redundant idle sequences\. Specifically, if more than three consecutive idle frames were detected, we discarded the redundant frames starting from the fourth frame\. To construct action chunks, we grouped actions into sets of four\. For each set, we retained the observation solely from the first frame and concatenated the actions chronologically\. We applied padding with null actions for groups containing fewer than four inputs\.

Balancing and Final Composition An analysis of the dataset composition revealed a severe distributional imbalance, as the task of mining stone constituted 39\.45% of the total data\. This skew significantly compromised the performance of the initial SFT model and hindered the subsequent RFT process\. We addressed this issue by downsampling the overrepresented tasks to achieve a balanced distribution, which enabled the successful training of our supervised baseline\. Through these processing and balancing steps, we ultimately collected approximately 4 million state\-action samples for SFT\.

## Appendix DHyperparameters

Our training framework consists of three progressive stages: \(i\) World Knowledge Learning, \(ii\) Imitation Learning, and \(iii\) Multi\-Turn RL\. The specific training parameters for all three stages are detailed in Table[5](https://arxiv.org/html/2605.20246#A4.T5)\.

Table 5:Hyperparameter settings across different training stages\.HyperparameterWorld KnowledgeImitation LearningMulti\-Turn RLTrainable ComponentsFullLanguage ModelsFullLR SchedulerCosineCosine\-Warm\-up Ratio0\.10\.1\-Global Batch Size32328OptimizerAdamWAdamWAdamWLearning Rate1×10−51\\times 10^\{\-5\}1×10−51\\times 10^\{\-5\}1×10−61\\times 10^\{\-6\}Group Size \(GG\)\-\-8Clipping Parameter \(ϵ\\epsilon\)\-\-0\.2
## Appendix EExtending GROW to Simulated Language Table

To demonstrate that GROW remains effective in a markedly different embodied manipulation setting and is not limited to Minecraft, we retrain GROW on the simulated Language Table\[[14](https://arxiv.org/html/2605.20246#bib.bib55)\], which provides a tabletop environment characterized by continuous control dynamics\.

### E\.1Experimental Setup

The policy receives an egocentric RGB view from a tabletop manipulation robot as input\. Each state is represented by an image with resolution180×320×3180\\times 320\\times 3\. The action space consists of two\-dimensional control values, corresponding to the displacement of the robot end effector along thexxandyyaxes\. To keep the action interface consistent with the Minecraft setting, we useμ\\mu\-law encoding to discretize thexxandyyaxes independently into 21 bins, yielding 42 action tokens in total\. Each bin is mapped to a reserved token\.

For multi\-turn RL, we follow the task setup of the simulated Language Table\[[14](https://arxiv.org/html/2605.20246#bib.bib55)\]and train on five task families:block2block,block2abs,block2rel,block2blockrel, andseparate\. The detailed definitions, success criteria, and the exact number of conditions for each task family are summarized in Table[6](https://arxiv.org/html/2605.20246#A5.T6)\.

Table 6:Definitions and configurations of the five task families in the simulated Language Table environment\.Task FamilyAgent ActionSuccess Criterion\# Conditionsblock2blockPushes a source block to another target block\.Distance between the source and target blocks is below a threshold\.56\(8 src×\\times7 tgt\)block2absPushes a block to an absolute board location \(9 locations11footnotemark:1\)\.Distance between the block and the target location is below a threshold\.72\(8 blk×\\times9 loc\)block2relPushes a block to a relative offset location \(8 directions22footnotemark:2\)\.Distance between the block and the invisible target offset location is below a threshold\.64\(8 blk×\\times8 dir\)block2blockrelPushes a source block to a relative offset location of another block \(8 directions22footnotemark:2\)\.Distance between the source block and the invisible target offset of the target block is below a threshold\.448\(8 src×\\times7 tgt×\\times8 dir\)separateSeparates two blocks\.Distance between the two blocks exceeds a predefined threshold\.56\(8 src×\\times7 tgt\)
19 absolute locations:top left, top center, top right, center left, center, center right, bottom left, bottom center, and bottom right\. 28 relative directions:left, right, up, down, up\-left, up\-right, down\-left, and down\-right\.

### E\.2Implementation Details

First, we initialize the policy of Qwen2\.5\-VL\-7B\-Instruct\[[2](https://arxiv.org/html/2605.20246#bib.bib38)\]via SFT on 1M state\-action samples\. We construct the SFT dataset from the real\-robot Language Table dataset by decomposing each episode into individual time steps\. For each time step, the corresponding visual state and annotated control signal form one state\-action sample\. This procedure yields 1M state\-action samples for policy initialization\. We train the cold\-start policy on 4 H200 GPUs for 8000 steps for about 3 days with a global batch size of 32\. We use AdamW as the optimizer with a cosine learning\-rate schedule and set the warm\-up ratio to 0\.1\. During SFT, all model components are updated\.

The simulated Language Table environment provides built\-in task\-specific verifiers for these task families\. We use these verifiers to compute binary episode\-level rewards according to the corresponding thresholded geometric success conditions\. During RL, we train the policy with GROW for 200 training steps on 2 H200 GPUs for about 2 Days\. The main hyperparameters are summarized in Table[7](https://arxiv.org/html/2605.20246#A5.T7)\.

Table 7:Hyperparameter settings for policy initialization in simulated Language Table\.HyperparameterValueGlobal Batch Size16OptimizerAdamWLearning Rate1×10−61\\times 10^\{\-6\}Group Size \(GG\)8Discount Factor \(γ\\gamma\)0\.995Clipping Parameter \(ϵ\\epsilon\)0\.2
### E\.3Results

As shown in Table[8](https://arxiv.org/html/2605.20246#A5.T8), the results show three main trends\. First, GROW achieves the highest success rate across all five simulated Language Table task families, improving the average success rate from65\.0265\.02with PPO to79\.4179\.41\. Second, the gains are especially clear on spatial reasoning tasks: GROW exceeds PPO by21\.9421\.94points onblock2rel, by18\.7018\.70points onblock2abs, and improvesblock2blockrelfrom46\.9146\.91to59\.4759\.47, suggesting stronger grounding of visual states, spatial targets, and action decisions\. Third, the comparison with the initialized policy shows that RL is critical: the initialized policy only achieves0\.020\.02–0\.030\.03on the four block\-pushing tasks and18\.818\.8onseparate, while GROW reaches at least59\.4759\.47on all task families and100\.00100\.00onseparate\. These results indicate that GROW substantially improves closed\-loop manipulation beyond supervised initialization and is effective in a continuous\-control embodied environment\.

Table 8:Success rates on the simulated Language Table tasks\.Methodblock2blockblock2absblock2relblock2blockrelseparateInitialized Policy0\.020\.030\.020\.0218\.8PPO53\.1162\.5365\.6246\.9196\.93Ours68\.8181\.2387\.5659\.47100\.00

## Appendix FCase Study in Minecraft

To visualize the agent’s decision\-making, we overlay its action space on the left of each case study frame\. We use a color\-coded telemetry system to distinguish intent: yellow denotes inactive control primitives, while red signifies the specific actions selected by the model at that time step\. This provides a direct, interpretable trace of the agent’s behavioral primitives—such as movement keys or camera adjustments—relative to the visual context\.

### F\.1Case Study: Kill Guardian

Task Dynamics This case study highlights the model’s ability to handle complex adversarial dynamics in Minecraft\. The target, a Guardian, presents a unique challenge: it is a ranged attacker that maintains a tactical distance, retreating when the player approaches too closely while staying within its own offensive range\. Since the agent is equipped only with a sword \(a melee weapon\), it must master a sophisticated behavioral loop involving active search, tactical approach, and precise engagement under evasive conditions\.

Target Search \(Figure[6\(a\)](https://arxiv.org/html/2605.20246#A6.F6.sf1),[6\(b\)](https://arxiv.org/html/2605.20246#A6.F6.sf2)\): When the Guardian moves out of the field of view \(FOV\), the agent does not wander aimlessly\. Instead, it performs a systematic search, utilizing camera yaw and pitch to scan the environment until the target is re\-acquired\.

Evasion and Approach \(Figure[6\(c\)](https://arxiv.org/html/2605.20246#A6.F6.sf3),[6\(d\)](https://arxiv.org/html/2605.20246#A6.F6.sf4)\): The agent identifies the Guardian’s ranged beam attack and initiates a direct approach\. It successfully closes the gap despite the Guardian’s attempts to maintain distance\.

Melee Engagement \(Figure[6\(e\)](https://arxiv.org/html/2605.20246#A6.F6.sf5),[6\(f\)](https://arxiv.org/html/2605.20246#A6.F6.sf6)\): Once within striking distance, the agent executes precise attack primitives while simultaneously tracking the target’s movements, eventually neutralizing the threat through melee strikes\.

![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/guardian_combat/target_search.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/guardian_combat/target_search_b.png)\(b\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/guardian_combat/evasion_approch.png)\(c\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/guardian_combat/evasion_approch_b.png)\(d\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/guardian_combat/melle_engagement.png)\(e\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/guardian_combat/melle_engagement_b.png)\(f\)

Figure 6:Behavioral analysis of the agent engaging a Guardian in combat\.
### F\.2Case Study: Craft Stone Shovel

Synthesizing advanced tools in Minecraft requires the agent to transition from spatial navigation to abstract GUI manipulation\. In this case study, the agent must perform a multi\-stage sequence: locating and interacting with a Crafting Table, navigating a multi\-page recipe book to identify a specific target \(Stone Shovel\), and executing the final synthesis\. This task tests the model’s ability to maintain long\-term goal coherence across different visual modalities \(3D world vs\. 2D interface\)\.

Real\-time Action Telemetry As visualized in Figure[7](https://arxiv.org/html/2605.20246#A6.F7), we use a color\-coded telemetry overlay to trace the agent’s intent: yellow denotes inactive controls, while red highlights active primitives\. This allows us to observe the shift from movement\-based exploration to precise UI\-based clicking\.

Environment Interaction \(Figure[7\(a\)](https://arxiv.org/html/2605.20246#A6.F7.sf1),[7\(b\)](https://arxiv.org/html/2605.20246#A6.F7.sf2)\): The agent approaches the Crafting Table and triggers the Use primitive \(red\) to open the synthesis interface\.

GUI Navigation \(Figure[7\(c\)](https://arxiv.org/html/2605.20246#A6.F7.sf3),[7\(d\)](https://arxiv.org/html/2605.20246#A6.F7.sf4),[7\(e\)](https://arxiv.org/html/2605.20246#A6.F7.sf5)\): Once the interface is active, the agent’s focus shifts to the recipe book\. The telemetry reveals a sequence of precise Attack \(simulated as "click"\) and Camera \(mouse cursor movement\) primitives as the agent scrolls through multiple pages to locate the Stone Shovel recipe\.

Target Synthesis \(Figure[7\(f\)](https://arxiv.org/html/2605.20246#A6.F7.sf6),[7\(g\)](https://arxiv.org/html/2605.20246#A6.F7.sf7),[7\(h\)](https://arxiv.org/html/2605.20246#A6.F7.sf8)\): Upon selecting the correct recipe, the agent executes the crafting command\. The successful completion is confirmed by the appearance of the item in the output slot and the subsequent "Stone Age" advancement notification\.

![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_0.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_a.png)\(b\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_b.png)\(c\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_c.png)\(d\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_d.png)\(e\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_e.png)\(f\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_f.png)\(g\)
![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/appendix/usercase/craft_stone_shovel/craft_stone_shovel_g.png)\(h\)

Figure 7:Case study of the agent synthesizing a Stone Shovel\.

## Appendix GPrompt in Experiments

We show our prompt in Figure[8](https://arxiv.org/html/2605.20246#A7.F8), which is used in our experiment\.

![Refer to caption](https://arxiv.org/html/2605.20246v1/assets/prompt.png)Figure 8:

Similar Articles

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Hugging Face Daily Papers

The MineExplorer benchmark evaluates multimodal large language model agents' open-world exploration abilities in Minecraft using atomic and multi-hop tasks designed through multi-agent synthesis. Experiments show that open-world exploration remains challenging, with strong models degrading sharply over longer trajectories.

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

arXiv cs.LG

Introduces PROWL, a prioritized regret-driven optimization framework that uses an adversarial curriculum to improve diffusion-based world model robustness by focusing on high-error trajectories, achieving better performance on out-of-distribution scenarios in MineRL.