Learning Visual Feature-Based World Models via Residual Latent Action

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

Summary

This paper introduces RLA-WM, a visual feature-based world model that leverages residual latent actions and flow matching to efficiently predict future visual states. The method outperforms existing video-diffusion and feature-based approaches while enabling novel robot learning techniques from offline, actionless demonstration videos.

World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

Original Article

View Cached Full Text

Cached at: 05/11/26, 06:55 PM

Paper page - Learning Visual Feature-Based World Models via Residual Latent Action

Source: https://huggingface.co/papers/2605.07079

Abstract

Visual world models predicting future visual features through residual latent action representations achieve superior performance and efficiency compared to existing methods while enabling novel robot learning approaches.

World modelspredict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-basedworld models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type oflatent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned fromDINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values viaflow matching. RLA-WM outperforms both state-of-the-art feature-based andvideo-diffusion world modelson simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improvepolicy learning. The first one is a minimalist world action model with RLA that learns fromactionless demonstration videos. The second one is the firstvisual RL frameworktrained entirely inside a world model learned from offline videos only, using avideo-aligned rewardand no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm

View arXiv page View PDF Project page GitHub9 Add to collection

Get this paper in your agent:

hf papers read 2605\.07079

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### xyzhang368/RLA-WM Robotics• Updatedabout 5 hours ago

Datasets citing this paper1

#### xyzhang368/RLA-WM Updatedabout 5 hours ago • 37

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.07079 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Learning Visual Feature-Based World Models via Residual Latent Action

Paper page - Learning Visual Feature-Based World Models via Residual Latent Action

Abstract

Models citing this paper1

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Submit Feedback

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers