Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making

arXiv cs.CL 06/25/26, 04:00 AM Papers
llm-agents world-modeling sequential-decision-making next-observation-prediction decision-oriented-training agent-authored-world-modeling
Summary
This paper introduces Agent-Authored World Modeling (AAWM), a training procedure that constructs world-model supervision based on the policy's own decision needs rather than next-observation prediction, aligning the learning objective with the dynamics required for effective decision-making.
arXiv:2606.25421v1 Announce Type: new Abstract: Recent studies on world modeling for Large Language Model (LLM) agents typically formulate the learning objective as next-observation prediction. However, this objective ties supervision to what a transition happens to reveal, which may omit the dynamics most relevant to the agent's current decision. To bridge this gap, we propose Agent-Authored World Modeling (AAWM), a training procedure that constructs supervision from the policy's own decision needs. Specifically, at each state, the agent identifies what it needs to understand about the environment before acting. These needs drive the retrieval of relevant transition evidence across trajectories, which is then synthesized into training targets that capture decision-oriented dynamics instead of reconstructing the next observation. This aligns the training objective with the dynamics the policy needs before acting, not with the contents of the next observation. Experimental results validate the effectiveness of AAWM across multiple environments and training settings. These results show that decision-aware world-model targets provide a more effective learning signal than next-observation prediction.
Original Article
View Cached Full Text
Cached at: 06/25/26, 05:11 AM
# Agent-Authored World Modeling for Sequential Decision Making
Source: [https://arxiv.org/html/2606.25421](https://arxiv.org/html/2606.25421)
## Beyond Next\-Observation Prediction: Agent\-Authored World Modeling for Sequential Decision Making

Guangfeng Cai1,Kaibing Yang1,Shuo He2,Yu Li1, Shengtian Yang1,Jiaqi Lv1,Lei Feng1 1Southeast University,2Meituan \{cgfeng,fenglei\}@seu\.edu\.cn

###### Abstract

Recent studies on world modeling for Large Language Model \(LLM\) agents typically formulate the learning objective as next\-observation prediction\. However, this objective ties supervision to what a transition happens to reveal, which may omit the dynamics most relevant to the agent’s current decision\. To bridge this gap, we propose Agent\-Authored World Modeling \(AAWM\), a training procedure that constructs supervision from the policy’s own decision needs\. Specifically, at each state, the agent identifies what it needs to understand about the environment before acting\. These needs drive the retrieval of relevant transition evidence across trajectories, which is then synthesized into training targets that capture decision\-oriented dynamics instead of reconstructing the next observation\. This aligns the training objective with the dynamics the policy needs before acting, not with the contents of the next observation\. Experimental results validate the effectiveness of AAWM across multiple environments and training settings\. These results show that decision\-aware world\-model targets provide a more effective learning signal than next\-observation prediction\.

Beyond Next\-Observation Prediction: Agent\-Authored World Modeling for Sequential Decision Making

Guangfeng Cai1, Kaibing Yang1, Shuo He2, Yu Li1,Shengtian Yang1,Jiaqi Lv1,Lei Feng1††thanks:Corresponding author\.1Southeast University,2Meituan\{cgfeng,fenglei\}@seu\.edu\.cn

## 1Introduction

LLM agents\(Achiamet al\.,[2023](https://arxiv.org/html/2606.25421#bib.bib31); Teamet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib35); Glmet al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib34); Liuet al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib33)\)have increasingly operated in partially observed environments that require multi\-turn interactions to gather information and complete tasks\. In such settings, success depends not only on choosing fluent actions but also on representing environment dynamics: which actions are valid, which state variables persist, and which observations matter for the next decision\. Recent work therefore fine\-tunes large language models to predict the environment’s next response, treating this objective as world modeling\(Zhanget al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib1); Liet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib2)\)\. The underlying assumption is thatbetter prediction of the next environment output will produce better action selection\.

![Refer to caption](https://arxiv.org/html/2606.25421v1/x1.png)Figure 1:Comparison between next observation prediction andAAWM\. Next\-observation prediction trains on the observation returned by each action, so the target is determined by what the environment happens to reveal\.AAWMasks the policy what it needs to know before acting and writes targets that describe the dynamics needed for decision making\.This assumption overlooks a key difference between prediction and decision:

the policy often needs the state information that affects its next action rather than a full reconstruction of the next observation\(Nairet al\.,[2020](https://arxiv.org/html/2606.25421#bib.bib26)\)\. Moreover,Liet al\.\([2024](https://arxiv.org/html/2606.25421#bib.bib5)\)showed that optimizing a model for goal\-directed behavior will reduce its predictive accuracy, suggesting that prediction quality and decision quality are not the same objective\. This raises a core target selection problem:what environment dynamics does the policy need in the world modeling target before acting?

Prior work has mostly addressed this problem by changing the form of the target\. Some methods compress observations into semantic future representations\(Berget al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib4)\), summarize transition\-level state changes\(Chaeet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib8)\), or add task information for downstream planning\(Qiaoet al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib15)\)\. These targets are often more compact or more planning\-friendly than raw next observations, but their content is still determined by the transition exposed in the dataset\.

As a result, the training target remains environment\-driven rather than decision\-driven:it reflects what the environment happened to reveal, instead of what the policy needs to understand before acting\.

To address this problem, we proposeAAWM\(Agent\-Authored World Modeling\), a training procedure that constructs training targets from the policy’s current decision needs\.[Figure˜1](https://arxiv.org/html/2606.25421#S1.F1)illustrates the difference from next observation prediction\. At each state, the policy first articulates its beliefs about environment dynamics and open questions whose answers could change the next action\. These statements serve as retrieval queries over a pool of transition records, gathering evidence about the queried dynamics from other trajectories\. The retrieved evidence, together with the current transition and the policy’s statements, is then synthesized into a natural\-language target describing decision\-oriented dynamics\. Fine\-tuning on these targets encourages the policy to represent the dynamics that matter for action selection instead of only reconstructing the next observation\.

We evaluate AAWM on ALFWorld and WebShop at two model scales\. With the same imitation learning and reinforcement learning setup, AAWM consistently outperforms next\-observation world modeling, with gains of up to 6\.3 and 6\.2 success\-rate points on the two environments\. A separate AgentGym evaluation\(Xiet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib55)\)across four environments confirms that AAWM is the only world modeling initialization that improves over imitation learning in every setting\. Component ablations show that Self\-Probing, Transition Retrieval, and world modeling fine\-tuning each contribute to the final gain\. Training analysis further shows that AAWM\-initialized policies sustain broader decision\-oriented reasoning during reinforcement learning while task success continues to rise\.

Our contributions are summarized as follows:

- •We identify target selection as a central problem in world modeling for language agents: training targets should capture the dynamics that affect action choice, not only reconstruct environment responses\.
- •We instantiate this principle asAAWM, a world modeling procedure in which the policy’s own beliefs and uncertainties determine what the training target addresses\.
- •We demonstrate that AAWM outperforms next\-observation world modeling across two environments, two model scales, and both supervised and reinforcement learning settings\.

![Refer to caption](https://arxiv.org/html/2606.25421v1/x2.png)Figure 2:AAWMtarget construction pipeline\. At a decision contextoto\_\{t\}, the policy produces confirmed patternsPtP\_\{t\}and open questionsQtQ\_\{t\}through Self\-Probing\. Each proposition retrieves evidence from the transition bank𝒯\\mathcal\{T\}, and the immediate transitionsℐt\\mathcal\{I\}\_\{t\}provide local evidence from the same context\. Dynamics Synthesis combines these inputs into world modeling targets that correct mistaken beliefs and answer open questions when evidence permits\.
## 2Related Work

#### LLMs as decision\-making agents\.

LLM agents have been widely studied for sequential decision making in software engineering\(Yanget al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib36)\), embodied interaction\(Wanget al\.,[2023](https://arxiv.org/html/2606.25421#bib.bib41)\), web navigation\(Guret al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib39); Yanget al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib36)\), and GUI operation\(Zhang and Zhang,[2024](https://arxiv.org/html/2606.25421#bib.bib37); Honget al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib38)\)\. Prompting methods such as chain\-of\-thought reasoning and ReAct\(Yaoet al\.,[2022b](https://arxiv.org/html/2606.25421#bib.bib40)\)structure intermediate reasoning traces to decompose complex tasks, while self\-reflection mechanisms such as Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2606.25421#bib.bib42)\)incorporate verbal feedback from prior failures to revise action selection\. Memory\(Parket al\.,[2023](https://arxiv.org/html/2606.25421#bib.bib45)\), retrieval\(Nakanoet al\.,[2021](https://arxiv.org/html/2606.25421#bib.bib46)\), and tool\-use\(Gouet al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib43); Schicket al\.,[2023](https://arxiv.org/html/2606.25421#bib.bib44)\)mechanisms further extend agent capabilities by supplying external context or executable operations during interaction\. However, these methods improve how the policy selects actions at inference time but do not change the supervision used to train the policy\.

#### World models for LLM agents\.

World models have been widely used to provide future information and auxiliary supervision for sequential decision making\. In LLM agents, recent work has instantiated this idea through transition\-level state change summarization\(Chaeet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib8)\), semantic compression\(Berget al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib4)\), result state prediction\(Guoet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib3)\), action simulation\(Yuet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib10)\), and knowledge injection\(Qiaoet al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib15)\)\.

A related line of work involves the agent more directly in world\-model construction\.Chenet al\.\([2025](https://arxiv.org/html/2606.25421#bib.bib11)\)internalizes environment dynamics from the agent’s own rollouts;Wanget al\.\([2026](https://arxiv.org/html/2606.25421#bib.bib30)\)reinforces explicit prediction of environment states during multi\-turn interactions;Huet al\.\([2025](https://arxiv.org/html/2606.25421#bib.bib13)\)reconstructs the thinking process with world\-model simulation; andZhanget al\.\([2025](https://arxiv.org/html/2606.25421#bib.bib1)\)generate additional interaction data from the agent’s own actions and train on the resulting outcomes\.

Despite this progress, the content of the training target is still determined by what each transition happens to reveal, which may omit the dynamics most relevant to the agent’s current decision\. Our work therefore aims to dynamically adjust the policy’s target to its current beliefs and open questions\.

## 3Preliminaries

Problem setup\.We consider a partially observed text environmentℰ\\mathcal\{E\}in which a language\-model policyπθ\\pi\_\{\\theta\}interacts through textual observations and actions\. At steptt, the policy receives the decision contextoto\_\{t\}and samples an actionat∼πθ\(⋅∣ot\)a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid o\_\{t\}\)\. The environment returns the next observationot\+1o\_\{t\+1\}, producing trajectories𝒟=\{τi\}\\mathcal\{D\}=\\\{\\tau\_\{i\}\\\}withτi=\(o0,a0,o1,…,oT\)\\tau\_\{i\}=\(o\_\{0\},a\_\{0\},o\_\{1\},\\ldots,o\_\{T\}\), and we write𝒰\(𝒟\)=\{\(ot,at,ot\+1\)\}\\mathcal\{U\}\(\\mathcal\{D\}\)=\\\{\(o\_\{t\},a\_\{t\},o\_\{t\+1\}\)\\\}for the set of all one\-step transition records\. From these records we maintain two collections\. The*global transition pool*𝒯\\mathcal\{T\}contains records from all trajectories as a single retrievable collection\. The*immediate transition set*ℐt⊆𝒯\\mathcal\{I\}\_\{t\}\\subseteq\\mathcal\{T\}contains records originating from the same contextoto\_\{t\}, including the logged transition and additional transitions obtained by executing alternative actions from that context\.

#### World modeling objective\.

A standard text world model predicts the environment response conditioned on the current context and action\. For each transition record, a textual outcome targeto¯t\+1=g\(ot\+1\)\\bar\{o\}\_\{t\+1\}=g\(o\_\{t\+1\}\)is constructed from the next observation, whereggmay be the identity map, a state change extractor, or a summarizer\. The next response prediction objective is to minimize

ℒobs\(θ\)=−∑\(ot,at,ot\+1\)∈𝒰\(𝒟\)log⁡pθ\(o¯t\+1∣ot,at\)\\mathcal\{L\}\_\{\\mathrm\{obs\}\}\(\\theta\)=\-\\sum\_\{\\begin\{subarray\}\{c\}\(o\_\{t\},a\_\{t\},o\_\{t\+1\}\)\\\\ \\in\\,\\mathcal\{U\}\(\\mathcal\{D\}\)\\end\{subarray\}\}\\log p\_\{\\theta\}\\\!\\left\(\\bar\{o\}\_\{t\+1\}\\mid o\_\{t\},a\_\{t\}\\right\)\(1\)This objective is useful when the next observation contains the dynamics needed for control, but the dynamics that determine a good action can be sparse, implicit, or distributed across earlier transitions\. Before acting, the policy may need to verify whether its current understanding of the environment is correct and to resolve specific gaps in that understanding\. These needs reflect the state of the policy’s own world model, not the content of the next observation\. AAWM addresses this by constructing world model targets from the policy’s decision\-oriented modeling needs\.

To align the world modeling objective with what the policy needs before acting,AAWMconstructs targets from the policy’s current decision context, with an overview of the pipeline shown in[Figure˜2](https://arxiv.org/html/2606.25421#S1.F2)\.

Self\-Probingelicits the policy’s current beliefs and open questions\.Transition Retrievalselects supporting records from𝒯\\mathcal\{T\}\.Dynamics Synthesiscombines the probes, the immediate transitions, and the retrieved transitions into a fine\-tuning target\. The following subsections describe these stages in detail\.

### 4\.1Self\-Probing

Instead of defining target content from the next observation,AAWMelicits the policy’s modeling needs at contextoto\_\{t\}\. Specifically, the policy is prompted to produce two sets of propositions:

\(Pt,Qt\)∼πθ\(⋅∣ot\),\(P\_\{t\},Q\_\{t\}\)\\sim\\pi\_\{\\theta\}\(\\cdot\\mid o\_\{t\}\),\(2\)wherePt=\{pt1,…,ptn\}P\_\{t\}=\\\{p\_\{t\}^\{1\},\\ldots,p\_\{t\}^\{n\}\\\}are*confirmed patterns*andQt=\{qt1,…,qtn\}Q\_\{t\}=\\\{q\_\{t\}^\{1\},\\ldots,q\_\{t\}^\{n\}\\\}are*open questions*\.

Confirmed patterns specify parts of the policy’s current understanding that may require verification or correction against retrieved transitions\.

Open questions identify dynamics that the policy has not yet resolved and whose answers could affect the next action\.

Together,\(Pt,Qt\)\(P\_\{t\},Q\_\{t\}\)determine what the training target should contain and drive the subsequent retrieval and synthesis steps\. The prompt template is in Appendix[D](https://arxiv.org/html/2606.25421#A4)\.

### 4\.2Transition Retrieval

Since the required dynamics may be absent from the immediate transition setℐt\\mathcal\{I\}\_\{t\}, we retrieve relevant transitions from the global transition pool𝒯\\mathcal\{T\}\.

For each propositionr∈Pt∪Qtr\\in P\_\{t\}\\cup Q\_\{t\}, AAWM retrievesKKtransitions from𝒯\\mathcal\{T\}using maximal marginal relevance\(Carbonell and Goldstein,[1998](https://arxiv.org/html/2606.25421#bib.bib29)\), which balances relevance torrwith diversity across the selected records, ensuring the evidence covers different aspects of the queried dynamics instead of redundant records\. The retrieved set is

ℛt=⋃r∈Pt∪QtRetrieve\(r,𝒯;K\)\.\\mathcal\{R\}\_\{t\}=\\bigcup\_\{r\\in P\_\{t\}\\cup Q\_\{t\}\}\\mathrm\{Retrieve\}\(r,\\mathcal\{T\};K\)\.\(3\)Implementation details are in Appendix[E](https://arxiv.org/html/2606.25421#A5)\.

### 4\.3Dynamics Synthesis

Given the policy’s acting needs specified by the probes and the transition evidence grounded by retrieval, the final stage synthesizes an agent\-authored world\-modeling target\.

Specifically, an external instruction\-following modelfsynf\_\{\\mathrm\{syn\}\}receives the current contextoto\_\{t\}, confirmed patternsPtP\_\{t\}, open questionsQtQ\_\{t\}, the immediate transitionsℐt\\mathcal\{I\}\_\{t\}from the same context, and the retrieved transitionsℛt\\mathcal\{R\}\_\{t\}from the global pool:

yt=fsyn\(ot,Pt,Qt,ℐt,ℛt\)\.y\_\{t\}=f\_\{\\mathrm\{syn\}\}\\\!\\left\(o\_\{t\},\\ P\_\{t\},\\ Q\_\{t\},\\ \\mathcal\{I\}\_\{t\},\\ \\mathcal\{R\}\_\{t\}\\right\)\.\(4\)The model is instructed to correct any unsupported beliefs, resolve open questions the evidence permits, and summarize the dynamics most relevant to action selection\. The resultingyty\_\{t\}thus describes environment dynamics shaped by what the policy needs before acting\. The synthesis prompt is in Appendix[D](https://arxiv.org/html/2606.25421#A4), and the synthesis model is described in[Section˜5\.1](https://arxiv.org/html/2606.25421#S5.SS1)\.

### 4\.4World Modeling

Let𝒟A=\{\(ot,yt\)\}\\mathcal\{D\}\_\{\\mathrm\{A\}\}=\\\{\(o\_\{t\},y\_\{t\}\)\\\}denote the dataset produced by Dynamics Synthesis\. AAWM fine\-tunes the policy with the agent\-authored world\-modeling objective

ℒA\(θ\)=−∑\(ot,yt\)∈𝒟Alog⁡pθ\(yt∣ot\),\\mathcal\{L\}\_\{\\mathrm\{A\}\}\(\\theta\)=\-\\sum\_\{\(o\_\{t\},y\_\{t\}\)\\in\\mathcal\{D\}\_\{\\mathrm\{A\}\}\}\\log p\_\{\\theta\}\\\!\\left\(y\_\{t\}\\mid o\_\{t\}\\right\),\(5\)where the subscriptA\\mathrm\{A\}denotes AAWM\. The target therefore updates the policy toward the dynamics specified by its own decision needs\. The resulting parameters initialize the subsequent imitation learning and reinforcement learning stages\.

## 5Experiments

### 5\.1Setup

#### Environments and evaluation\.

We evaluate on ALFWorld\(Shridharet al\.,[2020](https://arxiv.org/html/2606.25421#bib.bib48)\)and WebShop\(Yaoet al\.,[2022a](https://arxiv.org/html/2606.25421#bib.bib49)\)\. ALFWorld is a textual embodied environment with six household task types, Pick, Clean, Cool, Look, Heat, and Pick2\. WebShop is an HTML shopping environment where the agent searches, inspects, and purchases products according to user instructions\. ALFWorld reports success rate for each task type and the aggregateAll, while WebShop reports success rateSucc\.and reward scoreScore\. We follow the environment configuration and evaluation protocol from the open\-source repository\(Fenget al\.,[2026](https://arxiv.org/html/2606.25421#bib.bib20)\)\.

#### Baseline\.

We use Implicit World Modeling \(IWM\) as the world modeling baseline, a representative next observation prediction method for language agents\(Zhanget al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib1)\)\. Given the current observation and action\(ot,at\)\(o\_\{t\},a\_\{t\}\),IWMtrains the model to predict the next observation\. Following the original IWM setting, the target is the raw textual observation on ALFWorld and an offline summary of the next observation on WebShop\.

#### World modeling data\.

We draw world modeling data from AgentTraj\-L\(Xiet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib55)\)and keep all successful trajectories executable in the open\-source repository\(Fenget al\.,[2026](https://arxiv.org/html/2606.25421#bib.bib20)\)as𝒟\\mathcal\{D\}\. At every action step, we execute the logged action and three additional actions from the same state, producing four transitions that formℐt\\mathcal\{I\}\_\{t\}and populate𝒯\\mathcal\{T\}\. BothIWMandAAWMshare the same𝒟\\mathcal\{D\}and𝒯\\mathcal\{T\}within each backbone scale\. Trajectory counts and action\-sampling details are provided in Appendix[A\.1](https://arxiv.org/html/2606.25421#A1.SS1)\.

#### Models and target construction\.

We train the policy at two scales, Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib50)\)\. Self\-Probing uses the policy backbone at the corresponding scale, and Transition Retrieval embeds records with Qwen3\-Embedding\-0\.6B\(Yanget al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib51)\)\. We use Qwen3\-30B\-A3B\-Instruct\-2507\(Yanget al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib51)\)only for target construction, where it produces WebShop observation summaries forIWMand synthesized targets forAAWM\. This keeps the target construction model shared between the two world modeling methods\.

#### Training protocol\.

BothIWMandAAWMfirst initialize the policy by supervised fine\-tuning on their respective world modeling targets, followed by a lightweight imitation learning stage on 32 trajectories from𝒟\\mathcal\{D\}for one epoch\. All RL rows then run Group Relative Policy Optimization \(GRPO;Shaoet al\.,[2024](https://arxiv.org/html/2606.25421#bib.bib52)\)\.

[Table˜1](https://arxiv.org/html/2606.25421#S5.T1)reports prompting baselines and trained\-policy results at each backbone scale\. Within each backbone, all trained rows share the same trajectory source, transition pool, imitation data, and GRPO budget\.

Base denotes downstream training without world modeling initialization\. We report mean and standard deviation over three seeds\. More details are in Appendix[A\.1](https://arxiv.org/html/2606.25421#A1.SS1)\.

### 5\.2Main Results

Table 1:Main results on ALFWorld and WebShop\. In trained panels, Base denotes no world modeling initialization,IWMdenotes next\-observation world modeling, andAAWMdenotes Agent\-Authored World Modeling\. Trained rows report mean±\\pmstd over three seeds; bold marks the best within each backbone and condition\.[Table˜1](https://arxiv.org/html/2606.25421#S5.T1)shows thatAAWMconsistently improves over the world\-modeling baseline on both agentic benchmarks and at both backbone scales\. Although closed\-source prompting models remain competitive, with Gemini\-2\.5\-Pro\(Teamet al\.,[2023](https://arxiv.org/html/2606.25421#bib.bib53)\)and GPT\-4o\(Achiamet al\.,[2023](https://arxiv.org/html/2606.25421#bib.bib31)\)reaching 60\.3% and 48\.0% success on ALFWorld and 35\.9% and 23\.7% on WebShop, post\-training narrows and then reverses this gap for the Qwen\-family models\. Among the trained methods,AAWMachieves the strongest overall performance\.

#### AAWMprovides a stronger initialization than next observation prediction\.

At the 1\.5B scale, Base reaches 74\.0% success on ALFWorld, whileIWMimproves it to 76\.8% andAAWMfurther improves it to 83\.1%\. WebShop shows the same ordering, with success increasing from 59\.4% \(Base\) to 67\.2% \(IWM\) and 73\.4% \(AAWM\)\. At the 7B scale,AAWMagain achieves the strongest aggregate performance, reaching 90\.1% on ALFWorld and 76\.6% on WebShop\.

#### AAWMconsistently outperforms on the harder tasks\.

The ALFWorld breakdown shows that the aggregate gain is concentrated on the more difficult categories\. Pick and Clean are easier for most methods, and they are the only categories whereIWMslightly exceedsAAWM: Pick at 1\.5B and Clean at 7B\. On the harder categories, the ordering reverses and the margins become larger\. At 1\.5B,AAWMimproves overIWMby 11\.0 points on Cool, 10\.6 on Pick2, and 7\.5 on Look\. At 7B, the gains are 9\.0 points on Look, 7\.4 on Cool, and 6\.0 on Pick2\. These tasks require reasoning over multi\-step preconditions and persistent object states, which are difficult to recover from a single observation\. This is where Self\-Probing and cross\-trajectory retrieval are expected to be most useful, because they expose decision\-relevant dynamics beyond the immediate observation\.

#### AAWMproduces a more decision\-oriented initialization\.

A further question is whether this advantage comes from the world modeling targets themselves or from their interaction with the imitation learning stage\. To isolate this, we remove the imitation learning stage so that world modeling alone carries the full initialization budget before GRPO \(Appendix[B](https://arxiv.org/html/2606.25421#A2)\)\.IWMthen drops sharply on ALFWorld, whileAAWMstays close to GRPO from the backbone and still improves WebShop\. Since both methods share the same Qwen3\-30B\-A3B\-Instruct\-2507 construction model on WebShop and differ only in target content, this gap reflects the nature of the targets rather than model capacity\.AAWMdepends far less on action imitation because its targets already describe the dynamics the policy needs when selecting actions\.

### 5\.3Ablation Studies

Table 2:Ablation at Qwen2\.5\-1\.5B\-Instruct\. All rows use the same lightweight imitation learning stage and are evaluated without GRPO\. Base denotes imitation learning without world modeling initialization\. ALFWorld reports aggregate success rate, and WebShop reports success rate\. Cells report mean±\\pmstd over three seeds\.AAWMrests on two design assumptions:Self\-Probingdirects the target toward decision\-oriented dynamics, andTransition Retrievalsupplies evidence beyond the current state\.[Table˜2](https://arxiv.org/html/2606.25421#S5.T2)tests both by removing each component at 1\.5B, with all rows sharing the same imitation learning stage\. FullAAWMimproves ALFWorld from 12\.0% to 24\.7% and WebShop from 4\.2% to 10\.7% over Base\.

#### Retrieved evidence supplies dynamics beyond the current state\.

Removing Transition Retrieval lowers the results to 19\.3% on ALFWorld and 7\.5% on WebShop\. Without retrieved evidence,Dynamics Synthesissees only the immediate transitions at the current state, which often do not contain enough information to resolve the policy’s open questions\.

#### Self\-Probing tells synthesis which dynamics to address\.

Removing Self\-Probing lowers the results further to 18\.0% and 7\.0%\. Without the policy’s confirmed patterns and open questions, the synthesis model has no explicit signal about what the target should cover\. The two components together outperform either alone, confirming that both are necessary to shift the training target from environment\-driven content to decision\-driven content\.

### 5\.4Evaluation on More Environments

The main experiments use two environments with GRPO\. To test whether decision\-oriented targets remain beneficial across a broader set of environments under supervised training alone, we evaluate on four environments from AgentGym\(Xiet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib55)\)\. All methods train Qwen2\.5\-1\.5B\-Instruct on the same mixture of AgentTraj\-L trajectories from TextCraft, SciWorld, WebShop, and ALFWorld, and use the same imitation learning stage\. We also includeIWM\-Summary, a variant that uses the same Qwen3\-30B\-A3B model asAAWMto summarize observations before prediction, which controls for the capacity of the construction model\. Data counts and evaluation settings are in Appendix[A\.2](https://arxiv.org/html/2606.25421#A1.SS2)\.

Table 3:Results on four AgentGym environments at Qwen2\.5\-1\.5B\-Instruct\. All trained rows share the same AgentTraj\-L trajectory mixture and imitation learning stage\. All is weighted by the number of evaluation tasks per environment\. Cells report success rate \(%\), mean±\\pmstd over three seeds\.[Table˜3](https://arxiv.org/html/2606.25421#S5.T3)reports success rate in each environment and the weighted aggregate All\. Base reaches an All of 61\.1%\. The two next observation variants do not give a consistent improvement\.IWMfalls to 59\.6%, because predicting raw observations helps SciWorld but hurts WebShop and ALFWorld\.IWM\-Summary recovers WebShop and improves SciWorld for an All of 61\.4%, yet it drops sharply on ALFWorld to 20\.0%\.AAWMinstead reaches the best result in all four environments, with an All of 69\.3% and the largest single gain of 21\.0 points on SciWorld\. It is the only world modeling initialization that improves over Base in every environment\. Because all methods share the backbone, the trajectory mixture, and the imitation learning stage, this consistency confirms that targets shaped by the policy’s decision needs provide a more reliable learning signal than targets derived from the next observation, regardless of the specific environment\.

## 6Analysis

The experiments show thatAAWMconsistently outperforms next\-observation prediction\. We now investigate two mechanism questions: whether theAAWMinitialization leads to more effective exploration during GRPO rather than faster output collapse, and whetherSelf\-Probingproduces decision\-relevant propositions thatDynamics Synthesisresolves with grounded evidence\.

#### Training dynamics\.

A policy that converges faster could be collapsing to a narrow output distribution or exploring more effectively\. We distinguish these two explanations by tracking success rate and response entropy during GRPO, alongside response length and rollout content\.[Figure˜3](https://arxiv.org/html/2606.25421#S6.F3)shows the WebShop dynamics, and Appendix[C](https://arxiv.org/html/2606.25421#A3)reports the matching ALFWorld curves\. The success curves show thatAAWMlearns faster\. On WebShop all three methods stay low during the first ten steps, after whichAAWMrises much more rapidly than Base andIWMand keeps the highest success rate for most of training, and ALFWorld follows the same trend\. We then ask whether this faster convergence comes from premature collapse or from more effective exploration\. The entropy curves support exploration, sinceAAWMmaintains the highest response entropy for most of training even while its success rate rises fastest\. Its outputs are also longer on average, with 167\.6 tokens on WebShop and 98\.7 on ALFWorld against 121\.4 and 57\.8 forIWM, and the extra tokens appear mainly in the reasoning trace\. Representative rollouts in Appendix[G](https://arxiv.org/html/2606.25421#A7)confirm that theAAWMinitialized policy reasons about environment mechanisms before acting, including object persistence, action preconditions, and product attributes\. These observations indicate thatAAWMinduces richer and more useful exploration during GRPO, organized around decision\-oriented dynamics\. By contrast,IWMimproves over Base but keeps lower entropy and a smaller gain, which suggests that next observation prediction is less effective at encouraging active reasoning about environment mechanisms\.

![Refer to caption](https://arxiv.org/html/2606.25421v1/x3.png)Figure 3:GRPO training dynamics on WebShop under the imitation learning then reinforcement learning setting initialized by world modeling\. The backbone is Qwen2\.5\-1\.5B\-Instruct\. The top panel shows success rate, and the bottom panel shows response entropy\. Bold curves show time weighted EMA smoothing with coefficient 0\.95\.AAWMconverges faster while maintaining higher response entropy during training\.
#### Alignment of Self\-Probing and Synthesis\.

AAWMrests on two assumptions\.Self\-Probingshould produce propositions that target decision\-oriented dynamics rather than surface observations, andDynamics Synthesisshould answer those propositions with information grounded in transition evidence\. We test both with an LLM as judge evaluation over all\(ot,q,yt\)\(o\_\{t\},q,y\_\{t\}\)triples from data construction, whereoto\_\{t\}is the observation,qqis a single proposition, andyty\_\{t\}is the synthesis output\. The judge scores each triple on three dimensions \([Table˜4](https://arxiv.org/html/2606.25421#S6.T4)\)\. Decision helpfulnessd¯1\\bar\{d\}\_\{1\}measures whether the proposition targets a mechanism that could affect action choice\. Confirmed pattern correctiond¯2P\\bar\{d\}\_\{2\}^\{P\}measures whether the synthesis finds and corrects a factual error in a confirmed pattern\. Open question resolutiond¯2Q\\bar\{d\}\_\{2\}^\{Q\}measures whether the synthesis gives a definite answer to an open question\. The judge prompt is in Appendix[D](https://arxiv.org/html/2606.25421#A4)\.

Table 4:LLM\-as\-judge evaluation ofSelf\-ProbingandDynamics Synthesis\. Each cell reports the fraction of\(ot,q,yt\)\(o\_\{t\},q,y\_\{t\}\)triples satisfying the criterion\. The judge is Qwen3\-30B\-A3B\-Instruct\-2507\.Nearly every proposition targets decision\-oriented dynamics, withd¯1≥0\.96\\bar\{d\}\_\{1\}\\geq 0\.96, soSelf\-Probingusually asks about mechanisms that can affect the next action rather than surface observations\. Confirmed pattern correction is less frequent, at 0\.19 on ALFWorld and 0\.25 on WebShop, which is expected because the score is positive only when a confirmed pattern contains an error that the synthesis then corrects\. When the policy raises an open question, the synthesis resolves it in almost all cases, withd¯2Q≥0\.96\\bar\{d\}\_\{2\}^\{Q\}\\geq 0\.96\. Together these results confirm that theAAWMtraining signal is decision\-oriented already at the data construction stage, which directly explains the stronger exploration observed above\. Representative triples are in Appendix[F](https://arxiv.org/html/2606.25421#A6)\.

#### Sensitivity to the number of propositions\.

We finally vary the number of confirmed patterns and open questionsnnproduced at each state \(Appendix[A\.3](https://arxiv.org/html/2606.25421#A1.SS3)\)\. Increasingnnfrom 1 to 3 clearly improves both environments, which confirms that a single proposition does not cover the dynamics needed for action selection, while moving from 3 to 5 adds little\. We therefore usen=3n=3in all main experiments as a compact setting that captures most of the benefit\.

## 7Conclusion

World model training for language agents has typically predicted the next environment output, but prediction accuracy and decision quality serve different goals\. AAWM addresses this gap by letting the policy state what it understands and what remains uncertain, then constructing training targets from these statements\. Under matched training budgets, AAWM consistently outperforms next\-observation world modeling across two environments, two model scales, and both supervised and reinforcement learning settings, with a separate AgentGym evaluation confirming the same result across four additional environment mixtures\. A natural extension is to repeatSelf\-Probingafter each training round so that targets evolve with the policy’s understanding\. More broadly, these results suggest that world model targets should be evaluated by the decisions they help the policy make\.

## Limitations

AAWMrunsSelf\-Probingonce during data construction, so the training target reflects the policy’s beliefs at that point rather than its beliefs as training proceeds\. RepeatingSelf\-Probingafter each training round could turn the target into a curriculum that follows the policy’s changing understanding, at the cost of additional construction compute, and we leave this to future work\.

Our experiments also cover text\-only environments\. ExtendingAAWMto multimodal settings, where a vision language policy must model dynamics from both images and text, is a natural next step that we have not yet explored\.

## Ethical Considerations

This work studies world\-model training for language agents in sandboxed benchmark environments\. Because the proposed method improves sequential decision\-making ability, similar techniques could potentially be misused in open\-ended web, GUI, or tool\-use settings where autonomous actions may have external consequences\. In addition, synthesized world\-model targets may reflect incomplete or biased transition evidence, which could lead agents to learn overconfident or incorrect environment dynamics\. Our experiments do not deploy agents in real\-world systems, and practical deployment would require additional safeguards, monitoring, and validation of generated targets\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p1.1),[§5\.2](https://arxiv.org/html/2606.25421#S5.SS2.p1.1)\.
- J\. Berg, C\. Zhu, Y\. Bao, I\. Durugkar, and A\. Gupta \(2025\)Semantic world models\.arXiv preprint arXiv:2510\.19818\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p4.1),[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Carbonell and J\. Goldstein \(1998\)The use of mmr, diversity\-based reranking for reordering documents and producing summaries\.InProceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,pp\. 335–336\.Cited by:[§4\.2](https://arxiv.org/html/2606.25421#S4.SS2.p2.4)\.
- H\. Chae, N\. Kim, K\. Ong, M\. Gwak, G\. Song, J\. Kim, S\. Kim, D\. Lee, and J\. Yeo \(2025\)Web agents with world models: learning and leveraging environment dynamics in web navigation\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 63707–63738\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p4.1),[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Chen, T\. Zhu, Z\. Wang, J\. Zhang, K\. Wang, S\. Gao, T\. Xiao, Y\. W\. Teh, J\. He, and M\. Li \(2025\)Internalizing world models via self\-play finetuning for agentic rl\.arXiv preprint arXiv:2510\.15047\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p2.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2026\)Group\-in\-group policy optimization for llm agent training\.Advances in Neural Information Processing Systems38,pp\. 46375–46408\.Cited by:[§A\.1](https://arxiv.org/html/2606.25421#A1.SS1.p4.2),[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px3.p1.5)\.
- T\. Glm, A\. Zeng, B\. Xu, B\. Wang, C\. Zhang, D\. Yin, D\. Zhang, D\. Rojas, G\. Feng, H\. Zhao,et al\.\(2024\)Chatglm: a family of large language models from glm\-130b to glm\-4 all tools\.arXiv preprint arXiv:2406\.12793\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, Y\. Yang, M\. Huang, N\. Duan, W\. Chen,et al\.\(2024\)Tora: a tool\-integrated reasoning agent for mathematical problem solving\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 48362–48395\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Guo, O\. D\. Domingues, R\. Avalos, A\. Courville, and F\. Strub \(2025\)World modelling improves language model agents\.arXiv preprint arXiv:2506\.02918\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p1.1)\.
- I\. Gur, H\. Furuta, A\. Huang, M\. Safdari, Y\. Matsuo, D\. Eck, and A\. Faust \(2024\)A real\-world webagent with planning, long context understanding, and program synthesis\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 52690–52717\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Hong, W\. Wang, Q\. Lv, J\. Xu, W\. Yu, J\. Ji, Y\. Wang, Z\. Wang, Y\. Dong, M\. Ding,et al\.\(2024\)Cogagent: a visual language model for gui agents\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 14281–14290\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Hu, B\. Xia, Y\. Wu, A\. Yu, Y\. Zou, Q\. Chen, S\. Wang, J\. Jin, K\. Li, W\. Jiao,et al\.\(2025\)Agent2World: learning to generate symbolic world models via adaptive multi\-agent feedback\.arXiv preprint arXiv:2512\.22336\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p2.1)\.
- M\. Li, W\. Shi, A\. Pagnoni, P\. West, and A\. Holtzman \(2024\)Predicting vs\. acting: a trade\-off between world modeling & agent modeling\.arXiv preprint arXiv:2407\.02446\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p3.1)\.
- Y\. Li, H\. Wang, J\. Qiu, Z\. Yin, D\. Zhang, C\. Qian, Z\. Li, P\. Ma, G\. Chen, and H\. Ji \(2025\)From word to world: can large language models be implicit text\-based world models?\.arXiv preprint arXiv:2512\.18832\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p1.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p1.1)\.
- S\. Nair, S\. Savarese, and C\. Finn \(2020\)Goal\-aware prediction: learning to model what matters\.InInternational Conference on Machine Learning,pp\. 7207–7219\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p3.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders,et al\.\(2021\)Webgpt: browser\-assisted question\-answering with human feedback\.arXiv preprint arXiv:2112\.09332\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Qiao, R\. Fang, N\. Zhang, Y\. Zhu, X\. Chen, S\. Deng, Y\. Jiang, P\. Xie, F\. Huang, and H\. Chen \(2024\)Agent planning with world knowledge model\.Vol\.37,pp\. 114843–114871\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p4.1),[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px4.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.Advances in neural information processing systems36,pp\. 68539–68551\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px5.p1.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2020\)Alfworld: aligning text and embodied environments for interactive learning\.arXiv preprint arXiv:2010\.03768\.Cited by:[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px1.p1.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§5\.2](https://arxiv.org/html/2606.25421#S5.SS2.p1.1)\.
- K\. Team, Y\. Bai, Y\. Bao, Y\. Charles, C\. Chen, G\. Chen, H\. Chen, H\. Chen, J\. Chen, N\. Chen,et al\.\(2025\)Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p1.1)\.
- G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar \(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Wang, P\. Zhang, Z\. Wang, Y\. Gao, L\. Li, Q\. Wang, H\. Chen, Y\. Lu, Z\. Yang, L\. Wang,et al\.\(2026\)Vagen: reinforcing world model reasoning for multi\-turn vlm agents\.Advances in Neural Information Processing Systems38,pp\. 172871–172933\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p2.1)\.
- Z\. Xi, Y\. Ding, W\. Chen, B\. Hong, H\. Guo, J\. Wang, X\. Guo, D\. Yang, C\. Liao, W\. He,et al\.\(2025\)Agentgym: evaluating and training large language model\-based agents across diverse environments\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 27914–27961\.Cited by:[§A\.1](https://arxiv.org/html/2606.25421#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.25421#S1.p7.1),[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px3.p1.5),[§5\.4](https://arxiv.org/html/2606.25421#S5.SS4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px4.p1.1)\.
- J\. Yang, C\. E\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press \(2024\)Swe\-agent: agent\-computer interfaces enable automated software engineering\.Advances in Neural Information Processing Systems37,pp\. 50528–50652\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022a\)Webshop: towards scalable real\-world web interaction with grounded language agents\.Advances in Neural Information Processing Systems35,pp\. 20744–20757\.Cited by:[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022b\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Yu, B\. Peng, R\. Xu, M\. Galley, H\. Cheng, S\. Nath, J\. Gao, and Z\. Yu \(2025\)Dyna\-think: synergizing reasoning, acting, and world model simulation in ai agents\.arXiv preprint arXiv:2506\.00320\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Zhang, X\. Chen, B\. Liu, T\. Xue, Z\. Liao, Z\. Liu, X\. Wang, Y\. Ning, Z\. Chen, X\. Fu,et al\.\(2025\)Agent learning via early experience\.arXiv preprint arXiv:2510\.08558\.Cited by:[§1](https://arxiv.org/html/2606.25421#S1.p1.1),[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px2.p2.1),[§5\.1](https://arxiv.org/html/2606.25421#S5.SS1.SSS0.Px2.p1.1)\.
- Z\. Zhang and A\. Zhang \(2024\)You only look at screens: multimodal chain\-of\-action agents\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 3132–3149\.Cited by:[§2](https://arxiv.org/html/2606.25421#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AExperimental Details

### A\.1Main Experiment Details

Trajectory selection\.AgentTraj\-L is a collection of high\-quality agent trajectories released with AgentGym\(Xiet al\.,[2025](https://arxiv.org/html/2606.25421#bib.bib55)\)\. For the main ALFWorld and WebShop experiments, we select complete trajectories whose action sequences can be executed from start to finish in the GiGPO open\-source evaluation code and reach a successful outcome\. The selected set𝒟\\mathcal\{D\}contains 1,655 trajectories with 19,039 action steps for ALFWorld, and 1,027 trajectories with 4,971 action steps for WebShop\.

Transition collection\.We build the transition bank from every action step in𝒟\\mathcal\{D\}\. At each step, we execute the action in the trajectory and three additional actions from the same state, so each step contributes exactly four one\-step transitions\. For ALFWorld, the three additional actions are sampled uniformly without replacement from the admissible action set after removing the trajectory action andhelp\. For WebShop, two additional actions are sampled from the matched policy backbone with temperature 0\.8, and one additional action is sampled uniformly from the admissible action set after removing the trajectory action\. Sampled actions are canonicalized and deduplicated\. If a policy sampled action cannot be executed or duplicates an existing action, we replace it with an unused admissible action\. This procedure produces exactly three additional actions for every step\. The final transition bank contains 76,156 one\-step transitions for ALFWorld and 19,884 one\-step transitions for WebShop\.

Supervised fine\-tuning settings\.World model fine\-tuning and imitation learning both use full\-parameter supervised fine\-tuning\. The learning rate is1×10−51\\times 10^\{\-5\}\. The effective batch size is 16 for ALFWorld and 8 for WebShop\. The world modeling stage uses two epochs for bothIWMandAAWM\. The imitation learning stage uses one epoch\. The imitation learning subset contains 32 trajectories from𝒟\\mathcal\{D\}, with 429 action steps for ALFWorld and 158 action steps for WebShop\. For each action step, the model is supervised to generate the action content in the format<action\>\.\.\.</action\>\.

GRPO settings\.We use GRPO for all reinforcement learning experiments and follow the public implementation ofFenget al\.\([2026](https://arxiv.org/html/2606.25421#bib.bib20)\)\. For ALFWorld, the maximum prompt length is 2048 tokens, the maximum response length is 512 tokens, and each episode allows up to 50 environment steps\. For WebShop, the maximum prompt length is 4096 tokens, the maximum response length is 512 tokens, and each episode allows up to 15 environment steps\. Both environments use a rule\-based reward of 10 for success and 0 for failure, with a penalty of−0\.1\-0\.1for invalid actions\. The actor learning rate is1×10−61\\times 10^\{\-6\}\. For group\-based RL, the group size is 8 and each rollout samples 16 groups, giving 128 environments in total\. The rollout temperature is 1\.0, the validation temperature is 0\.4, and the KL coefficient is 0\.01\. The mini batch size is 256 for ALFWorld and 64 for WebShop\.

### A\.2Multi\-Environment Evaluation Details

The multi\-environment experiment uses the AgentGym evaluation code and is separate from the main ALFWorld and WebShop setting\. It trains on the full AgentTraj\-L train split for TextCraft, SciWorld, WebShop, and ALFWorld, and evaluates on the full AgentTraj\-L eval split for each environment\. No additional branch transitions are collected in this experiment, because the mixed train split already provides broad transition coverage across environments\. ForAAWM, the immediate transition setℐt\\mathcal\{I\}\_\{t\}contains only the original one\-step transition at the corresponding step\.

The training data contain 2,420 trajectories for ALFWorld, 3,930 for WebShop, 374 for TextCraft, and 2,120 for SciWorld, for a total of 8,844 trajectories\. Evaluation uses 200 tasks each for ALFWorld, WebShop, and SciWorld, and 100 tasks for TextCraft\. The backbone is Qwen2\.5\-1\.5B\-Instruct\. All supervised runs use full fine\-tuning with learning rate1×10−51\\times 10^\{\-5\}and effective batch size 32\.IWM,IWM\-Summary, andAAWMuse two epochs of world model fine\-tuning followed by one epoch of imitation learning\. Base only uses one epoch of imitation learning without world model fine\-tuning\. TheIWM\-Summary baseline summarizes observations in all four environments using Qwen3\-30B\-A3B\-Instruct\-2507 before next\-observation prediction\. Evaluation uses greedy decoding with temperature 0\.0\. The maximum numbers of interaction rounds are 20 for TextCraft, 30 for SciWorld, 6 for WebShop, and 30 for ALFWorld\.

### A\.3Sensitivity to Self\-Probing Propositions

[Table˜5](https://arxiv.org/html/2606.25421#A1.T5)varies the number of confirmed patterns and open questionsnnproduced at each state\. Increasingnnfrom 1 to 3 gives a clear improvement on both environments, showing that a single proposition is insufficient to cover the dynamics needed for action selection\. Increasingnnfrom 3 to 5 yields smaller additional gains\. We therefore usen=3n=3in all main experiments as a compact setting that captures most of the benefit\.

Table 5:Sensitivity to the number of Self\-Probing propositionsnnat Qwen2\.5\-1\.5B\-Instruct\. Each setting uses the same world model fine\-tuning and lightweight imitation learning protocol and is evaluated without GRPO\. ALFWorld reports aggregate success rate, and WebShop reports success rate\. Cells report mean±\\pmstd over three seeds\.

## Appendix BReinforcement Learning without Imitation Learning

We isolate whether world modeling initialization remains useful when GRPO starts without the lightweight imitation learning stage\.[Table˜6](https://arxiv.org/html/2606.25421#A2.T6)compares direct GRPO from the backbone with GRPO from the two world modeling initializations at Qwen2\.5\-1\.5B\-Instruct\.

Without imitation learning,IWMreaches only 13\.2% on ALFWorld and 54\.2% on WebShop\. This is far below direct GRPO from the backbone on ALFWorld and also lower on WebShop\. By contrast,AAWMreaches 70\.6% on ALFWorld and 63\.8% on WebShop\. It therefore preserves the direct GRPO performance on ALFWorld and improves WebShop over the base row\. This shows thatAAWMrelies less on imitation learning than next observation prediction\. A likely reason is thatAAWMtrains on decision\-oriented dynamics, so its world modeling target is closer to the information used by the policy during reinforcement learning\.

Table 6:Direct reinforcement learning at Qwen2\.5\-1\.5B\-Instruct\. All rows start GRPO without the lightweight imitation learning stage\. Base denotes GRPO directly from the backbone\.IWMandAAWMdenote GRPO from the corresponding world modeling initialization\. ALFWorld reports aggregate success rate, and WebShop reports success rate\. Cells report mean±\\pmstd over three seeds\.
## Appendix CAdditional Training Dynamics

[Figure˜4](https://arxiv.org/html/2606.25421#A3.F4)shows the ALFWorld training dynamics under the same imitation learning then reinforcement learning setting as[Figure˜3](https://arxiv.org/html/2606.25421#S6.F3)\.AAWMreaches higher success earlier than Base andIWM, while also maintaining higher response entropy through most of training\. This matches the WebShop pattern and shows that the faster improvement ofAAWMdoes not come from early output collapse\. Instead, the initialization supports broader and more effective exploration during GRPO\.

![Refer to caption](https://arxiv.org/html/2606.25421v1/x4.png)Figure 4:GRPO training dynamics on ALFWorld under the imitation learning then reinforcement learning setting initialized by world modeling\. The backbone is Qwen2\.5\-1\.5B\-Instruct\. The top panel shows success rate, and the bottom panel shows response entropy\. Bold curves show time\-weighted EMA smoothing with coefficient 0\.95\.AAWMreaches higher success earlier while preserving broader exploration\.
## Appendix DPrompt Templates

This appendix reproduces the prompt templates used in Self\-Probing and Dynamics Synthesis \([Sections˜4\.1](https://arxiv.org/html/2606.25421#S4.SS1)and[4\.3](https://arxiv.org/html/2606.25421#S4.SS3)\) and evaluation \([Section˜6](https://arxiv.org/html/2606.25421#S6)\)\. Curly\-brace tokens \(\{\.\.\.\}\) denote runtime variables\.

Self\-Probing prompt\{decision\_context\}You are an agent operating in an interactive environment\. Before taking your next action, externalize your understanding — your internal model of how this environment operates\.Output exactly in this format — two categories, each with 2 to 3 statements:\[CONFIRMED PATTERNS\]Cause\-and\-effect relationships, requirements, or constraints you believe to be true about how this environment operates\. State them as general rules you’ve inferred, not observations you currently see\.\[OPEN QUESTIONS\]Specific uncertainties about environment mechanics whose answers would affect which action you choose next\.Format: Each statement on its own line\. No numbering, no bullet points\.Your statements will be verified and answered, then provided back to you to inform your current decision\.For this environment, output exactly\{max\_beliefs\}confirmed pattern\(s\) and exactly\{max\_questions\}open question\(s\)\. Do not output extra statements\.

The variablesmax\_beliefsandmax\_questionsare set tonn, the per\-state proposition count reported in[Section˜5\.1](https://arxiv.org/html/2606.25421#S5.SS1)\.

Dynamics Synthesis prompt\{decision\_context\}— Agent’s Current Understanding —\[CONFIRMED PATTERNS\]\{confirmed\_patterns\}\[OPEN QUESTIONS\]\{open\_questions\}— Retrieved Evidence —\{retrieved\_transition\_evidence\}— Current Step Outcome —\{immediate\_transitions\}Internalize the retrieved evidence and the current step outcomes shown above, then perform the following task\. Write as if stating knowledge directly — do not reference or cite the evidence records\.Your task:1\. Correct any confirmed patterns that appear to be wrong\.2\. Answer the open questions where possible\.3\. Synthesize the key environment dynamics most relevant to the current task goal and action selection\.Requirements:– Describe cause\-effect relationships specific to this environment — not general knowledge any agent would already possess\.– Focus on dynamics bearing on the current task goal and the situation at hand\.– DO NOT contradict the facts observed in the current and prior steps\.– DO NOT include action suggestions or plans\.– Write a concise, coherent paragraph with no preamble or closing remarks\.

Theretrieved\_transition\_evidencefield groupsKKretrieved transitions under the proposition that retrieved them:

Retrieved transition formattingFor confirmed pattern 1:Record 1:State: “\{observation\_t\}”Action: “\{action\_t\}”Result: “\{observation\_t1\}”…RecordKK:State: “\{observation\_t\}”Action: “\{action\_t\}”Result: “\{observation\_t1\}”…For open questionnn:Record 1:State: “\{observation\_t\}”Action: “\{action\_t\}”Result: “\{observation\_t1\}”…RecordKK:State: “\{observation\_t\}”Action: “\{action\_t\}”Result: “\{observation\_t1\}”

Probing effectiveness judge promptYou are judging one proposition about an interactive text environment and one synthesis paragraph\.Input fields:– state\_context: task and current environment context\.– type: P or Q\.– proposition: a single confirmed pattern \(P\) or open question \(Q\)\.– synthesis\_output: a paragraph written after seeing transition evidence\.Definitions:P means the agent states a dynamics pattern it believes to hold\.Q means the agent asks about a dynamics point it wants resolved\.<sample\> state\_context:\{state\_context\}type:\{type\}proposition:\{proposition\}synthesis\_output:\{synthesis\_output\} </sample\>Dimensions for judgment:D1 Decision\-Helpfulness: Output 1 if the proposition concerns an environment mechanism, precondition, dependency, constraint, or action effect that could meaningfully inform task completion or current/future action choices\. Output 0 if it is only a surface description, tautology, generic observation, formatting detail, or irrelevant fact\.D2 Synthesis Contribution:– If type=P, output 1 only when the proposition contains an error, inaccuracy, or over\-confident unsupported claim, and synthesis\_output explicitly corrects it\. Otherwise output 0, including when a correct P is merely restated or extended\.– If type=Q, output 1 only when synthesis\_output gives a definite answer to the open question\. Output 0 if it is silent, hedged, partial, or only restates the question\.Instruction:Return strict JSON only, with fields in this order\. The two score fields must be integers, either 0 or 1:\{"rationale":"<=50 words","d1\_decision\_helpfulness":0, "d2\_synthesis\_contribution":1\}

## Appendix ETransition Retrieval Details

Each propositionq∈Pt∪Qtq\\in P\_\{t\}\\cup Q\_\{t\}from Self\-Probing acts as an independent retrieval query against the transition pool𝒯\\mathcal\{T\}\([Section˜4\.2](https://arxiv.org/html/2606.25421#S4.SS2)\)\. Transitions are embedded with Qwen3\-Embedding\-0\.6B and indexed with FAISS inner\-product search over normalized embeddings\. Each indexed unit is a single transition record\(ot,at,ot\+1\)\(o\_\{t\},a\_\{t\},o\_\{t\+1\}\)\.

Per\-query MMR\.For each propositionqq, we first retrieve a recall pool ofRRcandidates by embedding similarity, then apply maximal marginal relevance to select the finalKKtransitions that balance relevance toqqwith diversity among the selected set\.[Algorithm˜1](https://arxiv.org/html/2606.25421#alg1)describes the procedure\. Each proposition is processed independently; there is no cross\-query deduplication, so the same transition may appear under different propositions\. When retrieving for a state from trajectoryτi\\tau\_\{i\}, transitions originating from the same episode and overlapping step are excluded\.

Algorithm 1Per\-query MMR retrieval \(run independently for each proposition\)\.1:Proposition embedding

qq; transition index

𝒯\\mathcal\{T\}; recall pool size

RR; final hits

KK; trade\-off

λ\\lambda
2:

C←top\-Rfrom𝒯C\\leftarrow\\text\{top\-\}R\\text\{ from \}\\mathcal\{T\}by

⟨q,c⟩\\langle q,c\\rangle, excluding same\-episode hits

3:

S←∅S\\leftarrow\\emptyset
4:while

\|S\|<K\|S\|<Kdo

5:foreach

c∈C∖Sc\\in C\\setminus Sdo

6:

score\(c\)←λ⟨q,c⟩−\(1−λ\)maxs∈S⁡⟨c,s⟩\\mathrm\{score\}\(c\)\\leftarrow\\lambda\\langle q,c\\rangle\-\(1\{\-\}\\lambda\)\\max\_\{s\\in S\}\\langle c,s\\rangle
7:

c∗←arg⁡maxc⁡score\(c\)c^\{\*\}\\leftarrow\\arg\\max\_\{c\}\\mathrm\{score\}\(c\)
8:

S←S∪\{c∗\}S\\leftarrow S\\cup\\\{c^\{\*\}\\\}
9:return

SS

Hyperparameters\.[Table˜7](https://arxiv.org/html/2606.25421#A5.T7)lists the retrieval configuration\. The recall poolRRis larger than the finalKKto give MMR room to diversify within each proposition’s candidate set\.

Table 7:Transition retrieval hyperparameters\.
## Appendix FSelf\-Probing and Dynamics Synthesis Pairs

This appendix presents four representative triples produced duringAAWMdata construction\. Each case displays the task, the current observation, one confirmed pattern and one open question elicited by Self\-Probing, and the corresponding responses in the synthesis output\. Where synthesis sentences unrelated to the displayed propositions are omitted, the omission is marked with an ellipsis\.

ALFWorld Case 1Task\.Clean some potato and put it in microwave\.Observation\.You arrive at sinkbasin 1\. On the sinkbasin 1, you see nothing\.Confirmed pattern\.“I believe cleaning a potato with the sinkbasin 1 will remove dirt from it\.”Open question\.“I don’t know if the sinkbasin 1 can clean the potato without it being on the sink\.”Synthesis\.The environment permits cleaning an item while holding it at the sinkbasin without placing it on the sink first, as demonstrated by successful cleaning actions \(e\.g\.,clean potato 1 with sinkbasin 1\) when the sinkbasin’s contents do not include the item being cleaned\. The sinkbasin’s current empty state does not block the cleaning action, and the potato remains in inventory during cleaning, resulting in a clean potato ready for microwave use\.

ALFWorld Case 2Task\.Clean some spoon and put it in drawer\.Observation\.You open the drawer 3\. The drawer 3 is open\. In it, you see nothing\.Confirmed pattern\.“I think theinventoryaction reveals items the agent is currently holding\.”Open question\.“I’m unsure whether theexamineaction reveals hidden items in a drawer\.”Synthesis\.Theexamineaction does not reveal hidden items in a drawer, as it only confirms the visible state \(e\.g\., “drawer 4 is open\. In it, you see nothing”\), which is already known after opening\. … Theinventoryaction only displays items the agent is currently holding, not environmental objects like spoons on countertops or in drawers, so it cannot indicate the spoon’s location without prior pickup\.

WebShop Case 1Task\.Find me machine wash men’s t\-shirts with long sleeve with color: black, and size: xx\-large, and price lower than $70\.00\.Observation\.Search\.Confirmed pattern\.“I think specifying price constraints in the search query will filter results to items under $70\.”Open question\.“I’m unsure whether including size and color in the search terms ensures they are applied as filters\.”Synthesis\.The search engine does not apply price constraints \(e\.g\.,price < 70\.00\) as filters, as results include items exceeding the price limit\. Including color and size as attributes \(e\.g\.,color:black,size:xx\-large\) in the query string fails to filter results, as non\-matching items still appear\. …

WebShop Case 2Task\.Find me wide leg, slim fit, straight leg, loose fit women’s shorts with color: red, and size: xx\-large, and price lower than $40\.00\.Observation\.Back to Search \| < Prev \| color \| black \| blue \| dark gray \| gray \| navy \| red \| silver \| size \| small \| medium \| large \| x\-large \| xx\-large \| Burband Womens High Waist Bike Shorts Tummy Control Workout Yoga Pants Running Exercise Running Shorts with Side Pockets \| Price: $6\.82 \| Rating: N\.A\. \| Description \| Features \| Reviews \| Buy NowConfirmed pattern\.“I think color and size filters apply to the search results, not individual product pages\.”Open question\.“I’m unsure whether changing color/size filters after clicking a product affects the current product’s display\.”Synthesis\.The product detail page displays color and size filters that directly modify the current product’s available variants \(e\.g\., clicking “red” selects the red color option for this specific product, not the search results\), while “Back to Search” returns to the original filtered search results without altering the applied criteria\. This confirms that color/size selections on product detail pages are product\-specific, not global search filters, and “Back to Search” resets to the search context with the same filters active\.

## Appendix GRL Rollout Reasoning

During reinforcement learning,AAWM\-initialized policies \(Qwen2\.5\-7B\-Instruct\) frequently produce reasoning traces that reference specific environment mechanisms before selecting an action\. This appendix presents four rollout excerpts illustrating this pattern, with the complete reasoning trace for each step\.

ALFWorld Case 1Task\.Put a hot cup in cabinet\.Observation\.You arrive at microwave 1\. The microwave 1 is closed\.Reasoning\.The goal is to put a hot cup in a cabinet, and we have the cup in our inventory\. To achieve this, we need to heat the cup in the microwave\. The current observation shows the microwave is closed, but the microwave can be used to heat the cup regardless of its state \(closed or open\) when the cup is in inventory\. Heating the cup in the microwave will make it hot, and we can then move it to a cabinet\. Since we have the cup and are at the microwave, the immediate next step is to heat the cup\.Action\.heat cup 2 with microwave 1

ALFWorld Case 2Task\.Put a clean spatula in diningtable\.Observation\.You arrive at sinkbasin 1\. On the sinkbasin 1, you see nothing\.Reasoning\.Cleaning the spatula 1 with the sinkbasin 1 is a valid action that directly transforms the spatula into a clean state\. The environment validates that this action works as expected \(e\.g\., spatulas are successfully cleaned when executed on a sinkbasin with them in inventory\), and the current environment state shows the spatula is in inventory and the sinkbasin is empty \(no evidence of prior cleaning attempts or water requiring preconditions\), making cleaning the immediate step to achieve the goal\.Action\.clean spatula 1 with sinkbasin 1

WebShop Case 1Task\.Find me gluten free, protein serving pantry staples with flavor name: flax seeds, and size: 4\.25 ounce \(pack of 12\), and price lower than $70\.00\.Observation\.Back to Search \| < Prev \| flavor name \| country ranch \| multi\-seeds \| original almond \| … \| flax seeds \| honey cinnamon \| size \| 4\.25 ounce \(pack of 1\) \| … \| 4\.25 ounce \(pack of 12\) \| 7\.7 ounce \(pack of 1\) \| Blue Diamond Almonds Nut Thins Gluten Free Cracker Crisps, Hint of Sea Salt, 4\.25 Oz Boxes \(Pack of 12\) \| Price: $34\.44 \| Rating: N\.A\. \| Description \| Features \| Reviews \| Buy NowReasoning\.The current product listing \(Blue Diamond Almonds\) already satisfies the size and price criteria \(4\.25 ounce pack of 12 for $34\.44\), but the flavor is not “flax seeds” as specified in the goal\. The action to correct this is to change the flavor filter to “flax seeds” by clicking on the “flax seeds” option in the flavor menu\. This will update the product listing to show only items with the correct flavor, ensuring the goal’s flavor requirement is met\.Action\.click\[flax seeds\]

WebShop Case 2Task\.Find me loose fit women’s tops, tees & blouses with short sleeve, long sleeve with color: c5\-wine, and size: large, and price lower than $40\.00\.Observation\.Back to Search \| < Prev \| size \| small \| medium \| large \| x\-large \| xx\-large \| color \| a1\-blue \| a2\-black \| a2\-wine \| … \| c5\-army green \| c5\-black \| c5\-blue \| c5\-gray \| c5\-green \| c5\-purple \| c5\-red \| …Reasoning\.The product “FABIURT Summer Tops for Women” has already been filtered by c5\-wine color, so it remains to select the size large\. The size option “large” is listed in the available size options and does not conflict with the color choice\. Selecting size large will satisfy the goal of finding products in size large while maintaining the color c5\-wine\.Action\.click\[large\]

## Appendix HUse of Scientific Artifacts

Our experiments use publicly available research artifacts\. ALFWorld, WebShop, and the AgentGym codebase are released under MIT licenses\. The Qwen2\.5\-1\.5B\-Instruct, Qwen2\.5\-7B\-Instruct, Qwen3\-Embedding\-0\.6B, and Qwen3\-30B\-A3B\-Instruct\-2507 models are released under Apache\-2\.0 licenses\. The verl\-agent \(GiGPO\) codebase is also released under Apache\-2\.0\. The AgentTraj\-L dataset does not specify an explicit license on its public dataset page; we therefore use it only for research experiments and do not redistribute the original trajectories or any derivative training targets\. Should we release code, prompts, or trained checkpoints, we will do so under terms compatible with the licenses of the underlying artifacts\.
Beyond Next-Observation Prediction: Agent-Authored World Modeling for Sequential Decision Making

Similar Articles

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

Policy and World Modeling Co-Training for Language Agents

World Action Models: The Next Frontier in Embodied AI

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Submit Feedback

Similar Articles

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents
Policy and World Modeling Co-Training for Language Agents
World Action Models: The Next Frontier in Embodied AI
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing