MultiWorld: Scalable Multi-Agent Multi-View Video World Models
Summary
MultiWorld is a unified framework for multi-agent multi-view video world modeling that achieves accurate control of multiple agents while maintaining multi-view consistency through a Multi-Agent Condition Module and Global State Encoder.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - MultiWorld: Scalable Multi-Agent Multi-View Video World Models
Source: https://huggingface.co/papers/2604.18564
Abstract
MultiWorld is a unified framework for multi-agent multi-view world modeling that achieves accurate multi-agent control while maintaining multi-view consistency through specialized modules for condition handling and global state encoding.
Video world modelshave achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled asaction-conditioned video generationmodels that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-worldmulti-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintainingmulti-view consistency. We introduce theMulti-Agent Condition Moduleto achieve precise multi-agent controllability, and theGlobal State Encoderto ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, andmulti-view consistency. Project page: https://multi-world.github.io/
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2604\.18564
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18564 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18564 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18564 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Multi-Agent World Models (3 minute read)
γ-World is a generative multi-agent world model that supports independently controllable, permutation-symmetric agents using Simplex Rotary Agent Encoding and Sparse Hub Attention, achieving real-time 24 FPS rollouts and zero-shot generalization from two to four players.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM is a full-stack open-source framework that converts bidirectional video diffusion models into real-time interactive video world models with controllable camera, low-latency rollout, and modular architecture.
Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation
Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.
Agora-1: The Multi-Agent World Model
Odyssey introduces Agora-1, a multi-agent world model that enables real-time shared simulations for multiple participants, demonstrated with a GoldenEye deathmatch game.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 is a multi-modal world model framework that generates high-fidelity 3D Gaussian Splatting scenes from text, images, and videos through specialized modules for panorama generation, trajectory planning, and scene composition, achieving state-of-the-art performance among open-source approaches.