LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies
Summary
LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving state-of-the-art success rates with up to 24x lower latency than pixel-space world action models.
View Cached Full Text
Cached at: 06/16/26, 03:32 PM
Paper page - LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies
Source: https://huggingface.co/papers/2606.15768 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.
Vision-Language-Action models(VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene.World-Action Models(WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposespredictive dynamicstorobot policiesthrough compactlatent visual subgoalsinstead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training alatent action modelin the latent space of a pretrainedvision foundation modeland repurposing itsforward decoderto predict future observation features forscene evolution. LaWAM then conditions action generation on these predictedlatent visual subgoalsto enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms peraction-chunk predictionand achieves up to 24x lower wall-clock latency than pixel-space WAMs.
View arXiv pageView PDFProject pageGitHub14Add to collection
Get this paper in your agent:
hf papers read 2606\.15768
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.15768 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15768 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15768 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Light-WAM is a lightweight world action model for efficient robot manipulation that uses a compact video backbone and downsampled latent space for future-video supervision, achieving high performance with low inference latency.
Learning Visual Feature-Based World Models via Residual Latent Action
This paper introduces RLA-WM, a visual feature-based world model that leverages residual latent actions and flow matching to efficiently predict future visual states. The method outperforms existing video-diffusion and feature-based approaches while enabling novel robot learning techniques from offline, actionless demonstration videos.
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to decouple world prediction from action execution, achieving efficient long-horizon planning and real-time control. It achieves state-of-the-art performance on robotic manipulation tasks with up to 92.8% success on RoboTwin and 78.3% on real-world tasks, while reaching 24.17 Hz closed-loop control.
The DAWN of World-Action Interactive Models
This paper introduces DAWN, a latent generative baseline for World-Action Interactive Models (WAIMs) that jointly models scene evolution and action generation through recursive refinement, achieving strong long-horizon planning in autonomous driving scenarios.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
This paper introduces World-Language-Action (WLA) models, embodied foundation models that jointly predict textual subtasks, subgoal images, and robot actions from text, images, and robot states, achieving state-of-the-art multi-task and long-horizon learning in simulated and real-world environments.