Tag
The Qwen-RobotWorld technical report presents a unified language-conditioned video world model for embodied intelligence, enabling future video prediction from current observations across various domains like robotics, autonomous driving, and navigation, with applications in synthetic data generation, policy evaluation, and planning.
Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.
This paper introduces an approach to map unitary operators into the latent space of an LLM, enabling quantum circuit synthesis and language-conditioned gate constraint specification, achieving competitive results on Clifford+T circuit synthesis.
AFUN proposes an affordance foundation model that predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments. The model outperforms baselines on multiple benchmarks and can be deployed for real-world tasks without fine-tuning.