MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Hugging Face Daily Papers 04/20/26, 12:00 AM Papers

Summary

MultiWorld is a unified framework for multi-agent multi-view video world modeling that achieves accurate control of multiple agents while maintaining multi-view consistency through a Multi-Agent Condition Module and Global State Encoder.

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/

Original Article

View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Source: https://huggingface.co/papers/2604.18564

Abstract

MultiWorld is a unified framework for multi-agent multi-view world modeling that achieves accurate multi-agent control while maintaining multi-view consistency through specialized modules for condition handling and global state encoding.

Video world modelshave achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled asaction-conditioned video generationmodels that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-worldmulti-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintainingmulti-view consistency. We introduce theMulti-Agent Condition Moduleto achieve precise multi-agent controllability, and theGlobal State Encoderto ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, andmulti-view consistency. Project page: https://multi-world.github.io/

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2604\.18564

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18564 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18564 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18564 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Paper page - MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Multi-Agent World Models (3 minute read)

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Agora-1: The Multi-Agent World Model

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Submit Feedback

Similar Articles

Multi-Agent World Models (3 minute read)

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Agora-1: The Multi-Agent World Model

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds