DreamX-World 1.0: A General-Purpose Interactive World Model
Summary
DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model that supports camera navigation, scene persistence, and promptable events across multiple domains, using novel techniques like E-PRoPE, causal forcing, and memory-conditioned scene persistence to achieve controllable long-horizon generation.
View Cached Full Text
Cached at: 06/16/26, 11:32 AM
Paper page - DreamX-World 1.0: A General-Purpose Interactive World Model
Source: https://huggingface.co/papers/2606.16993 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
DreamX-World 1.0 is a interactive text/image-to-video model that generates long-horizon content with camera control and scene persistence using specialized encoding, training techniques, and optimization methods.
DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduceE-PRoPE, a lightweight variant ofprojective positional encodingthat retainsPRoPE’s projective camera geometry while applyingcamera-aware attentionto spatially reduced tokens. We convert abidirectional video generatorinto a few-stepautoregressive world modelusingcausal forcing,DMD-style distillation, andlong-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks.Memory-Conditioned Scene Persistenceretrieves earlier views through camera-geometry-based retrieval, whileresidual recyclingmakes the conditioning path less sensitive to imperfect memory latents.Event Instruction Tuningadds composable event control, andreinforcement learning alignmentrecovers camera control and visual quality after distillation. Withmixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, andasynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.
View arXiv pageView PDFProject pageGitHub264Add to collection
Get this paper in your agent:
hf papers read 2606\.16993
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### GD-ML/DreamX-World-5B Image-to-Video• 5B• Updatedabout 4 hours ago • 1
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.16993 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.16993 in a Space README.md to link it from this page.
Collections including this paper2
Similar Articles
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 is a multi-modal world model framework that generates high-fidelity 3D Gaussian Splatting scenes from text, images, and videos through specialized modules for panorama generation, trajectory planning, and scene composition, achieving state-of-the-art performance among open-source approaches.
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.
tencent/HY-World-2.0
HY-World 2.0 is Tencent's open-source multi-modal 3D world model that reconstructs and generates 3D worlds from text, images, and videos, producing editable 3D assets (meshes/Gaussian Splatting) comparable to closed-source methods.
ActWorld: From Explorable to Interactive World Model via Action-Aware Memory
ActWorld proposes a chunk-autoregressive world model with hierarchical action-aware memory to support object interaction alongside navigation, addressing data and memory bottlenecks in existing interactive world models.
Starchild-1 by Odyssey
Odyssey released Starchild-1, claiming it as the first real-time multimodal world model.