Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Hugging Face Daily Papers Papers

Summary

Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:34 AM

Paper page - Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Source: https://huggingface.co/papers/2606.17030 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and embodied world knowledge corpus.

We introduce Qwen-RobotWorld, alanguage-conditioned video world modelforembodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT withMLLM Action Encoding, where a 60-layerdouble-stream diffusion transformercouples frozen Qwen2.5-VL semantics withvideo-VAE latentsthroughlayer-wise joint attention; b)Embodied World Knowledge(EWK), an 8.6Mvideo-text corpus(200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+ExpertProgressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

View arXiv pageView PDFProject pageAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.17030 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.17030 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.17030 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Qwen's Embodied World Modeling (28 minute read)

TLDR AI

The Qwen-RobotWorld technical report presents a unified language-conditioned video world model for embodied intelligence, enabling future video prediction from current observations across various domains like robotics, autonomous driving, and navigation, with applications in synthetic data generation, policy evaluation, and planning.

Qwen-AgentWorld: Language World Models for General Agents

Hacker News Top

Qwen-AgentWorld introduces language world models for agentic environments, covering seven domains with long chain-of-thought reasoning. The work includes a new benchmark, AgentWorldBench, and shows that world modeling improves downstream agent performance.

Qwen-Image-2.0 Technical Report

Hugging Face Daily Papers

Qwen-Image-2.0 is a new image generation foundation model that unifies high-fidelity synthesis and precise editing using Qwen3-VL and a Multimodal Diffusion Transformer. It excels in text-rich content, multilingual typography, and photorealistic generation.