Policy and World Modeling Co-Training for Language Agents
Summary
This paper introduces PaW, a co-training framework that adds auxiliary world modeling supervision to policy learning during on-policy RL rollouts, improving language agent training without additional computational overhead.
View Cached Full Text
Cached at: 06/02/26, 03:34 PM
Paper page - Policy and World Modeling Co-Training for Language Agents
Source: https://huggingface.co/papers/2606.02388 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
PaW is a co-training framework that combines policy learning and world modeling using on-policy reinforcement learning rollouts to improve language agent training without additional computational overhead.
Reinforcement learning(RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment.World modeling(WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe thaton-policy RLrollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy andWorld modelingco-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, andreward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.02388
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.02388 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.02388 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.02388 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Milestone-Guided Policy Learning for Long-Horizon Language Agents
This paper introduces BEACON, a milestone-guided policy learning framework designed to improve credit assignment and sample efficiency for long-horizon language agents. It demonstrates significant performance improvements over GRPO and GiGPO on benchmarks like ALFWorld, WebShop, and ScienceWorld.
Learning POMDP World Models from Observations with Language-Model Priors
This paper introduces Pinductor, a method that uses language model priors to efficiently learn POMDP world models from limited observation-action data, achieving performance comparable to methods with privileged hidden state access while surpassing traditional tabular approaches.
World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
This paper proposes Privileged-Future On-Policy Self-Distillation (PF-OPSD) for controlled concrete reasoning, combining world models' visual simulation with language models' abstract reasoning to improve prediction accuracy and robustness on two new benchmarks.
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents
Proposes CVT-RL, a constrained policy-gradient algorithm with policy-conditioned counterfactual contribution estimation and verifiable rewards, improving long-horizon language agent reliability and reducing reward hacking.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.