WALL-WM: Carving World Action Modeling at the Event Joints

Hugging Face Daily Papers 06/01/26, 12:00 AM Papers

Summary

WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

Original Article

View Cached Full Text

Cached at: 06/03/26, 11:40 PM

Paper page - WALL-WM: Carving World Action Modeling at the Event Joints

Source: https://huggingface.co/papers/2606.01955 Published on Jun 1

Submitted byhttps://huggingface.co/RuiliFeng

Ruilion Jun 3

Authors:

Abstract

WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.

WALL-WM is aWorld Action Modelthat shifts video-action learning from chunk-centric optimization to event-groundedVision-Language-Actionpretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimizefixed-length action chunksconditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turnsVLA traininginto short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data aroundsemantic events. Specifically, it pairs event-grounded VLA pretraining with adata ecosystembuilt fromevent-level captionsandcluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enablesvariable-length executionchunks, while theunified modeuses a VLM withStaircase Decodingto condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together withMuon-optimizer-basedlarge-scale pretraininginfrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achievingstate-of-the-art performancein large-scale real-worldgeneralizationevaluation.

View arXiv page View PDF Project page GitHub1.04k Add to collection

Get this paper in your agent:

hf papers read 2606\.01955

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01955 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.01955 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01955 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

WALL-WM: Carving World Action Modeling at the Event Joints

Paper page - WALL-WM: Carving World Action Modeling at the Event Joints

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

World Action Models: The Next Frontier in Embodied AI

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Submit Feedback

Similar Articles

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

World Action Models: The Next Frontier in Embodied AI

World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis