WALL-WM: Carving World Action Modeling at the Event Joints

Hugging Face Daily Papers Papers

Summary

WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.
Original Article
View Cached Full Text

Cached at: 06/03/26, 11:40 PM

Paper page - WALL-WM: Carving World Action Modeling at the Event Joints

Source: https://huggingface.co/papers/2606.01955 Published on Jun 1

·

Submitted byhttps://huggingface.co/RuiliFeng

Ruilion Jun 3

Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.

WALL-WM is aWorld Action Modelthat shifts video-action learning from chunk-centric optimization to event-groundedVision-Language-Actionpretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimizefixed-length action chunksconditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turnsVLA traininginto short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data aroundsemantic events. Specifically, it pairs event-grounded VLA pretraining with adata ecosystembuilt fromevent-level captionsandcluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enablesvariable-length executionchunks, while theunified modeuses a VLM withStaircase Decodingto condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together withMuon-optimizer-basedlarge-scale pretraininginfrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achievingstate-of-the-art performancein large-scale real-worldgeneralizationevaluation.

View arXiv pageView PDFProject pageGitHub1.04kAdd to collection

Get this paper in your agent:

hf papers read 2606\.01955

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01955 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.01955 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01955 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

Hugging Face Daily Papers

τ_0-WM is a unified video-action world model for robotic manipulation that integrates policy learning, video prediction, and action evaluation using a shared video diffusion backbone. It shows superior performance on challenging long-horizon and fine-grained tasks.

World Action Models: The Next Frontier in Embodied AI

Hugging Face Daily Papers

This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.