WALL-WM: Carving World Action Modeling at the Event Joints
Summary
WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.
View Cached Full Text
Cached at: 06/03/26, 11:40 PM
Paper page - WALL-WM: Carving World Action Modeling at the Event Joints
Source: https://huggingface.co/papers/2606.01955 Published on Jun 1
·
Submitted byhttps://huggingface.co/RuiliFeng
Ruilion Jun 3
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.
WALL-WM is aWorld Action Modelthat shifts video-action learning from chunk-centric optimization to event-groundedVision-Language-Actionpretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimizefixed-length action chunksconditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turnsVLA traininginto short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data aroundsemantic events. Specifically, it pairs event-grounded VLA pretraining with adata ecosystembuilt fromevent-level captionsandcluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enablesvariable-length executionchunks, while theunified modeuses a VLM withStaircase Decodingto condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together withMuon-optimizer-basedlarge-scale pretraininginfrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achievingstate-of-the-art performancein large-scale real-worldgeneralizationevaluation.
View arXiv pageView PDFProject pageGitHub1.04kAdd to collection
Get this paper in your agent:
hf papers read 2606\.01955
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.01955 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.01955 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.01955 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Light-WAM is a lightweight world action model for efficient robot manipulation that uses a compact video backbone and downsampled latent space for future-video supervision, achieving high performance with low inference latency.
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers
RepWAM introduces a world action modeling approach using representation visual-action tokenizers, aiming to learn unified visual and action representations for planning and control.
τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation
τ_0-WM is a unified video-action world model for robotic manipulation that integrates policy learning, video prediction, and action evaluation using a shared video diffusion backbone. It shows superior performance on challenging long-horizon and fine-grained tasks.
World Action Models: The Next Frontier in Embodied AI
This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis
This paper introduces World-Language-Action (WLA) models, embodied foundation models that jointly predict textual subtasks, subgoal images, and robot actions from text, images, and robot states, achieving state-of-the-art multi-task and long-horizon learning in simulated and real-world environments.