ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
Summary
ABot-M0.5 is a new World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency, achieving state-of-the-art results on long-horizon and fine-grained manipulation benchmarks.
View Cached Full Text
Cached at: 07/02/26, 03:46 AM
Paper page - ABot-M0.5: Unified Mobility-and-Manipulation World Action Model
Source: https://huggingface.co/papers/2607.00678 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
ABot-M0.5 is a World Action Model for mobile manipulation that improves performance through temporal granularity alignment, action space disentanglement, and train-test consistency in autoregressive prediction.
Mobile manipulationis a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existingWorld Action Models(WAMs) are still poorly aligned with the structure ofmobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and traininverse dynamicsunder supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight thatmobile manipulationrequires alignment at three levels:temporal granularity,action space, and train-test consistency. To aligntemporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridgingaction spacebetween video latents and embodiment-specific controls. To alignaction space, we design a dual-levelMixture-of-Transformersarchitecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose thedream-forcingtraining strategy that progressively trainsinverse dynamicson model-predicted videos, improving train-test alignment and robustness duringautoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2607\.00678
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2607.00678 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2607.00678 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.00678 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
τ_0-WM: A Unified Video-Action World Model for Robotic Manipulation
τ_0-WM is a unified video-action world model for robotic manipulation that integrates policy learning, video prediction, and action evaluation using a shared video diffusion backbone. It shows superior performance on challenging long-horizon and fine-grained tasks.
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to decouple world prediction from action execution, achieving efficient long-horizon planning and real-time control. It achieves state-of-the-art performance on robotic manipulation tasks with up to 92.8% success on RoboTwin and 78.3% on real-world tasks, while reaching 24.17 Hz closed-loop control.
World Action Models: The Next Frontier in Embodied AI
This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.
LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies
LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving state-of-the-art success rates with up to 24x lower latency than pixel-space world action models.
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Light-WAM is a lightweight world action model for efficient robot manipulation that uses a compact video backbone and downsampled latent space for future-video supervision, achieving high performance with low inference latency.