World Pilot: Steering Vision-Language-Action Models with World-Action Priors
Summary
World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving state-of-the-art zero-shot performance on manipulation tasks.
View Cached Full Text
Cached at: 06/11/26, 01:41 PM
Paper page - World Pilot: Steering Vision-Language-Action Models with World-Action Priors
Source: https://huggingface.co/papers/2606.12403
Abstract
World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.
Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distributionmanipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from aWorld-Action Model(WAM), routed into the decision chain through two complementary pathways.Latent Steeringconditions the perception layer on ascene-evolution latent, andAction Steeringsupplies ananticipated trajectoryas amotion priorto the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Pluszero-shot OOD benchmarkand the highest success rate on every real-robot setting across fourmanipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/
View arXiv pageView PDFProject pageGitHub9Add to collection
Get this paper in your agent:
hf papers read 2606\.12403
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### Chedan86/WorldPilot-LIBERO Robotics• Updatedabout 12 hours ago • 1
Datasets citing this paper1
#### Chedan86/WorldPilot-LIBERO-precompute Updatedabout 12 hours ago • 841 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12403 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
This paper introduces World-Language-Action (WLA) models, embodied foundation models that jointly predict textual subtasks, subgoal images, and robot actions from text, images, and robot states, achieving state-of-the-art multi-task and long-horizon learning in simulated and real-world environments.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.
Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators
The paper proposes Astra, an agentic spatial reasoning framework that couples a reinforcement learning-trained VLM policy with a world simulator to generate novel-view observations for improved spatial reasoning in Vision-Language Models.
Learning POMDP World Models from Observations with Language-Model Priors
This paper introduces Pinductor, a method that uses language model priors to efficiently learn POMDP world models from limited observation-action data, achieving performance comparable to methods with privileged hidden state access while surpassing traditional tabular approaches.
The DAWN of World-Action Interactive Models
This paper introduces DAWN, a latent generative baseline for World-Action Interactive Models (WAIMs) that jointly models scene evolution and action generation through recursive refinement, achieving strong long-horizon planning in autonomous driving scenarios.