ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Hugging Face Daily Papers Papers

Summary

ActWorld proposes a chunk-autoregressive world model with hierarchical action-aware memory to support object interaction alongside navigation, addressing data and memory bottlenecks in existing interactive world models.

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.
Original Article
View Cached Full Text

Cached at: 06/17/26, 03:35 AM

Paper page - ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

Source: https://huggingface.co/papers/2606.17730 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

ActWorld extends navigation-centric interactive world models to support object interaction through a chunk-autoregressive framework with hierarchical action-aware memory and persistent memory banks.

Interactive world modelsaim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends priornavigation-centric generatorsto support mid-rolloutobject interactionwithin achunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interactiondata with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards theevent-transition framesthat causally determine subsequent object states, leading to anaction-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated withper-chunk captionsviachain-of-thought reasoning. On the model side, we introduce a hierarchicalaction-aware memorydesign that routes history compression by interaction importance, complemented by apersistent memory bankthat maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and richobject interactionwithin a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2606\.17730

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17730 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17730 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17730 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Hugging Face Daily Papers

AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to decouple world prediction from action execution, achieving efficient long-horizon planning and real-time control. It achieves state-of-the-art performance on robotic manipulation tasks with up to 92.8% success on RoboTwin and 78.3% on real-world tasks, while reaching 24.17 Hz closed-loop control.

World Action Models: The Next Frontier in Embodied AI

Hugging Face Daily Papers

This survey paper introduces World Action Models (WAMs), a unified framework for embodied AI that integrates predictive state modeling with action generation. It provides a taxonomy of existing methods, analyzes the data ecosystem, and outlines evaluation protocols for this emerging paradigm.

The DAWN of World-Action Interactive Models

Hugging Face Daily Papers

This paper introduces DAWN, a latent generative baseline for World-Action Interactive Models (WAIMs) that jointly models scene evolution and action generation through recursive refinement, achieving strong long-horizon planning in autonomous driving scenarios.

Multi-Agent Transactive Memory

arXiv cs.AI

Proposes Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories to improve task performance and reduce interaction steps in interactive environments like ALFWorld and WebArena.