Tag
PAIWorld enhances diffusion-transformer world models with geometric awareness and cross-view attention to improve multi-view 3D consistency for robotic manipulation tasks, achieving state-of-the-art results on benchmarks.
WorldOlympiad presents a comprehensive benchmark for evaluating video-based world models across physical faithfulness, geometric consistency, and interaction fidelity, revealing significant gaps in current generative models.
CityRAG introduces a video generative model that produces long, physically grounded, 3D-consistent videos of real-world cities using geo-registered data, enabling realistic navigation and simulation for robotics and autonomous driving.