Learning Transferable Dynamics Priors from Action to World Modeling

Hugging Face Daily Papers Papers

Summary

This paper introduces A2World, a diffusion-based world model pretrained on large-scale robot manipulation data to learn transferable dynamics priors. The model can be adapted into a real-world simulator (A2World-sim) for policy evaluation or a video-action prediction model (A2World-policy) for action prediction, demonstrating benefits for both simulator-centric and policy-centric robot learning.

We study action-conditioned world modeling as a scalable way to learn transferable dynamics priors for robot learning. By pretraining a model to predict how actions drive visual scene evolution, the resulting world model captures reusable interaction dynamics beyond appearance-level video generation. Concretely, we pretrain a multi-view interactive base diffusion world model, A2World, on large-scale robot manipulation data with real action annotations. We validate the learned dynamics priors from two complementary perspectives. First, we adapt A2World into a task- or scene-specialized real-world simulator, A2World-sim, whose long-horizon rollouts support simulator-based policy evaluation and scalable what-if analysis by replacing real-robot rollouts with world model rollouts. Second, starting from the same pretrained weights, we adapt A2World into a video-action joint prediction model, A2World-policy, that predicts actions under visual and instruction conditioning. Experiments across simulation benchmarks and real-robot settings demonstrate that action-conditioned world model pretraining yields transferable dynamics priors that benefit both simulator-centric and policy-centric robot learning.
Original Article
View Cached Full Text

Cached at: 06/30/26, 07:34 AM

Paper page - Learning Transferable Dynamics Priors from Action to World Modeling

Source: https://huggingface.co/papers/2606.29501

Abstract

Action-conditioned world modeling enables transferable dynamics priors for robot learning through pretraining on large-scale manipulation data, supporting both simulator-based policy evaluation and video-action prediction.

We studyaction-conditionedworld modelingas a scalable way to learn transferabledynamics priorsfor robot learning. Bypretraininga model to predict how actions drive visual scene evolution, the resulting world model captures reusable interaction dynamics beyond appearance-level video generation. Concretely, we pretrain amulti-view interactivebasediffusion world model, A2World, on large-scalerobot manipulationdata with real action annotations. We validate the learneddynamics priorsfrom two complementary perspectives. First, we adapt A2World into a task- or scene-specialized real-world simulator, A2World-sim, whose long-horizon rollouts support simulator-based policy evaluation and scalable what-if analysis by replacing real-robot rollouts with world model rollouts. Second, starting from the same pretrained weights, we adapt A2World into avideo-action joint predictionmodel, A2World-policy, that predicts actions under visual and instruction conditioning. Experiments across simulation benchmarks and real-robot settings demonstrate thataction-conditionedworld modelpretrainingyields transferabledynamics priorsthat benefit both simulator-centric and policy-centric robot learning.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.29501

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.29501 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.29501 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.29501 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

The DAWN of World-Action Interactive Models

Hugging Face Daily Papers

This paper introduces DAWN, a latent generative baseline for World-Action Interactive Models (WAIMs) that jointly models scene evolution and action generation through recursive refinement, achieving strong long-horizon planning in autonomous driving scenarios.