LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Hugging Face Daily Papers 06/14/26, 12:06 PM Papers

Summary

LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving state-of-the-art success rates with up to 24x lower latency than pixel-space world action models.

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.

Original Article

View Cached Full Text

Cached at: 06/16/26, 03:32 PM

Paper page - LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Source: https://huggingface.co/papers/2606.15768 Authors:

Abstract

LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.

Vision-Language-Action models(VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene.World-Action Models(WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposespredictive dynamicstorobot policiesthrough compactlatent visual subgoalsinstead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training alatent action modelin the latent space of a pretrainedvision foundation modeland repurposing itsforward decoderto predict future observation features forscene evolution. LaWAM then conditions action generation on these predictedlatent visual subgoalsto enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms peraction-chunk predictionand achieves up to 24x lower wall-clock latency than pixel-space WAMs.

View arXiv page View PDF Project page GitHub14 Add to collection

Get this paper in your agent:

hf papers read 2606\.15768

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.15768 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.15768 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.15768 in a Space README.md to link it from this page.

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Paper page - LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Learning Visual Feature-Based World Models via Residual Latent Action

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

The DAWN of World-Action Interactive Models

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Submit Feedback

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Learning Visual Feature-Based World Models via Residual Latent Action

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

The DAWN of World-Action Interactive Models

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis