Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Summary
Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.
View Cached Full Text
Cached at: 06/16/26, 11:34 AM
Paper page - Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Source: https://huggingface.co/papers/2606.17030 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and embodied world knowledge corpus.
We introduce Qwen-RobotWorld, alanguage-conditioned video world modelforembodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT withMLLM Action Encoding, where a 60-layerdouble-stream diffusion transformercouples frozen Qwen2.5-VL semantics withvideo-VAE latentsthroughlayer-wise joint attention; b)Embodied World Knowledge(EWK), an 8.6Mvideo-text corpus(200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+ExpertProgressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.
View arXiv pageView PDFProject pageAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.17030 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.17030 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.17030 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Qwen's Embodied World Modeling (28 minute read)
The Qwen-RobotWorld technical report presents a unified language-conditioned video world model for embodied intelligence, enabling future video prediction from current observations across various domains like robotics, autonomous driving, and navigation, with applications in synthetic data generation, policy evaluation, and planning.
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Qwen-VLA is a unified vision-language-action model for embodied decision-making, integrating manipulation, navigation, and trajectory prediction across different robot platforms. It uses a DiT-based action decoder and embodiment-aware prompt conditioning, achieving strong performance and out-of-distribution generalization.
Qwen-Robot Suite: A Foundation Model Suite for Physical World Intelligence
Qwen-Robot Suite is a foundation model suite designed for physical world intelligence, enabling robots to understand and interact with the real world effectively.
Qwen-AgentWorld: Language World Models for General Agents
Qwen-AgentWorld introduces language world models for agentic environments, covering seven domains with long chain-of-thought reasoning. The work includes a new benchmark, AgentWorldBench, and shows that world modeling improves downstream agent performance.
Qwen-Image-2.0 Technical Report
Qwen-Image-2.0 is a new image generation foundation model that unifies high-fidelity synthesis and precise editing using Qwen3-VL and a Multimodal Diffusion Transformer. It excels in text-rich content, multilingual typography, and photorealistic generation.