DreamX-World 1.0: A General-Purpose Interactive World Model

Hugging Face Daily Papers Papers

Summary

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model that supports camera navigation, scene persistence, and promptable events across multiple domains, using novel techniques like E-PRoPE, causal forcing, and memory-conditioned scene persistence to achieve controllable long-horizon generation.

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:32 AM

Paper page - DreamX-World 1.0: A General-Purpose Interactive World Model

Source: https://huggingface.co/papers/2606.16993 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

DreamX-World 1.0 is a interactive text/image-to-video model that generates long-horizon content with camera control and scene persistence using specialized encoding, training techniques, and optimization methods.

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduceE-PRoPE, a lightweight variant ofprojective positional encodingthat retainsPRoPE’s projective camera geometry while applyingcamera-aware attentionto spatially reduced tokens. We convert abidirectional video generatorinto a few-stepautoregressive world modelusingcausal forcing,DMD-style distillation, andlong-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks.Memory-Conditioned Scene Persistenceretrieves earlier views through camera-geometry-based retrieval, whileresidual recyclingmakes the conditioning path less sensitive to imperfect memory latents.Event Instruction Tuningadds composable event control, andreinforcement learning alignmentrecovers camera control and visual quality after distillation. Withmixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, andasynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

View arXiv pageView PDFProject pageGitHub264Add to collection

Get this paper in your agent:

hf papers read 2606\.16993

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### GD-ML/DreamX-World-5B Image-to-Video• 5B• Updatedabout 4 hours ago • 1

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.16993 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.16993 in a Space README.md to link it from this page.

Collections including this paper2

Similar Articles

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Hugging Face Daily Papers

Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.

tencent/HY-World-2.0

Hugging Face Models Trending

HY-World 2.0 is Tencent's open-source multi-modal 3D world model that reconstructs and generates 3D worlds from text, images, and videos, producing editable 3D assets (meshes/Gaussian Splatting) comparable to closed-source methods.

Starchild-1 by Odyssey

Product Hunt

Odyssey released Starchild-1, claiming it as the first real-time multimodal world model.