ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Hugging Face Daily Papers 07/01/26, 12:00 AM Papers

Summary

ABot-M0.5 is a new World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency, achieving state-of-the-art results on long-horizon and fine-grained manipulation benchmarks.

Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.

Original Article

View Cached Full Text

Cached at: 07/02/26, 03:46 AM

Paper page - ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Source: https://huggingface.co/papers/2607.00678 Authors:

Abstract

ABot-M0.5 is a World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency in autoregressive prediction.

Mobile manipulationis a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existingWorld Action Models(WAMs) are still poorly aligned with the structure ofmobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and traininverse dynamicsunder supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight thatmobile manipulationrequires alignment at three levels:temporal granularity,action space, and train-test consistency. To aligntemporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridgingaction spacebetween video latents and embodiment-specific controls. To alignaction space, we design a dual-levelMixture-of-Transformersarchitecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose thedream-forcingtraining strategy that progressively trainsinverse dynamicson model-predicted videos, improving train-test alignment and robustness duringautoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2607\.00678

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2607.00678 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2607.00678 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.00678 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Paper page - ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

World Action Models: The Next Frontier in Embodied AI

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Submit Feedback

Similar Articles

τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

World Action Models: The Next Frontier in Embodied AI

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding