Learning Visual Feature-Based World Models via Residual Latent Action
Summary
This paper introduces RLA-WM, a visual feature-based world model that leverages residual latent actions and flow matching to efficiently predict future visual states. The method outperforms existing video-diffusion and feature-based approaches while enabling novel robot learning techniques from offline, actionless demonstration videos.
View Cached Full Text
Cached at: 05/11/26, 06:55 PM
Paper page - Learning Visual Feature-Based World Models via Residual Latent Action
Source: https://huggingface.co/papers/2605.07079
Abstract
Visual world models predicting future visual features through residual latent action representations achieve superior performance and efficiency compared to existing methods while enabling novel robot learning approaches.
World modelspredict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-basedworld models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type oflatent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned fromDINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values viaflow matching. RLA-WM outperforms both state-of-the-art feature-based andvideo-diffusion world modelson simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improvepolicy learning. The first one is a minimalist world action model with RLA that learns fromactionless demonstration videos. The second one is the firstvisual RL frameworktrained entirely inside a world model learned from offline videos only, using avideo-aligned rewardand no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm
View arXiv pageView PDFProject pageGitHub9Add to collection
Get this paper in your agent:
hf papers read 2605\.07079
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### xyzhang368/RLA-WM Robotics• Updatedabout 5 hours ago
Datasets citing this paper1
#### xyzhang368/RLA-WM Updatedabout 5 hours ago • 37
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.07079 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Light-WAM is a lightweight world action model for efficient robot manipulation that uses a compact video backbone and downsampled latent space for future-video supervision, achieving high performance with low inference latency.
LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies
LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving state-of-the-art success rates with up to 24x lower latency than pixel-space world action models.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
This paper introduces World-Language-Action (WLA) models, embodied foundation models that jointly predict textual subtasks, subgoal images, and robot actions from text, images, and robot states, achieving state-of-the-art multi-task and long-horizon learning in simulated and real-world environments.
AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models
Proposes AR-VLA, an autoregressive action expert that generates continuous action sequences with long-term memory for context-aware robotic policy training, improving trajectory smoothness and task success rates over reactive VLA models.
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers
RepWAM introduces a world action modeling approach using representation visual-action tokenizers, aiming to learn unified visual and action representations for planning and control.