World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Hugging Face Daily Papers Papers

Summary

This paper introduces World-Language-Action (WLA) models, embodied foundation models that jointly predict textual subtasks, subgoal images, and robot actions from text, images, and robot states, achieving state-of-the-art multi-task and long-horizon learning in simulated and real-world environments.

We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the world modeling interface to learn from extensive egocentric videos as in the world-action model (WAM) and the language reasoning capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising the semantic-level textual intention and complementary fine-grained physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from cross-embodiment robot videos without action annotations.
Original Article
View Cached Full Text

Cached at: 06/05/26, 06:07 AM

Paper page - World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Source: https://huggingface.co/papers/2606.05979 Authors:

,

,

,

,

,

,

,

,

,

,

Abstract

World-language-action models combine textual instruction processing with robot state prediction through an autoregressive transformer backbone, enabling efficient long-horizon task execution and cross-embodiment learning.

We propose world-language-action (WLA) models as a new class ofembodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining theworld modeling interfaceto learn from extensiveegocentric videosas in theworld-action model(WAM) and thelanguage reasoningcapacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an autoregressive (AR) Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the next state, comprising thesemantic-level textual intentionand complementaryfine-grained physical dynamics. The physical dynamics are supervised by theworld modeling objectivebased on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for theAction Expert. WLA leveragesmeta-queriesto make the world prediction implicitly impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enabletest-time scalingfor improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\% success rate on RoboTwin2.0 Clean and 56.5\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly fromcross-embodiment robot videoswithout action annotations.

View arXiv pageView PDFGitHub6Add to collection

Get this paper in your agent:

hf papers read 2606\.05979

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05979 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.05979 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05979 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

World Action Models: The Next Frontier in Embodied AI

Hugging Face Daily Papers

This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.

Learning Visual Feature-Based World Models via Residual Latent Action

Hugging Face Daily Papers

This paper introduces RLA-WM, a visual feature-based world model that leverages residual latent actions and flow matching to efficiently predict future visual states. The method outperforms existing video-diffusion and feature-based approaches while enabling novel robot learning techniques from offline, actionless demonstration videos.

The DAWN of World-Action Interactive Models

Hugging Face Daily Papers

This paper introduces DAWN, a latent generative baseline for World-Action Interactive Models (WAIMs) that jointly models scene evolution and action generation through recursive refinement, achieving strong long-horizon planning in autonomous driving scenarios.

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

Hugging Face Daily Papers

τ_0-WM is a unified video-action world model for robotic manipulation that integrates policy learning, video prediction, and action evaluation using a shared video diffusion backbone. It shows superior performance on challenging long-horizon and fine-grained tasks.