MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Hugging Face Daily Papers Papers

Summary

MultiWorld is a unified framework for multi-agent multi-view video world modeling that achieves accurate control of multiple agents while maintaining multi-view consistency through a Multi-Agent Condition Module and Global State Encoder.

Video world models have achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled as action-conditioned video generation models that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-world multi-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintaining multi-view consistency. We introduce the Multi-Agent Condition Module to achieve precise multi-agent controllability, and the Global State Encoder to ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, and multi-view consistency. Project page: https://multi-world.github.io/
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Source: https://huggingface.co/papers/2604.18564

Abstract

MultiWorld is a unified framework for multi-agent multi-view world modeling that achieves accurate multi-agent control while maintaining multi-view consistency through specialized modules for condition handling and global state encoding.

Video world modelshave achieved remarkable success in simulating environmental dynamics in response to actions by users or agents. They are modeled asaction-conditioned video generationmodels that take historical frames and current actions as input to predict future frames. Yet, most existing approaches are limited to single-agent scenarios and fail to capture the complex interactions inherent in real-worldmulti-agent systems. We present MultiWorld, a unified framework for multi-agent multi-view world modeling that enables accurate control of multiple agents while maintainingmulti-view consistency. We introduce theMulti-Agent Condition Moduleto achieve precise multi-agent controllability, and theGlobal State Encoderto ensure coherent observations across different views. MultiWorld supports flexible scaling of agent and view counts, and synthesizes different views in parallel for high efficiency. Experiments on multi-player game environments and multi-robot manipulation tasks demonstrate that MultiWorld outperforms baselines in video fidelity, action-following ability, andmulti-view consistency. Project page: https://multi-world.github.io/

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2604\.18564

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18564 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18564 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18564 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Multi-Agent World Models (3 minute read)

TLDR AI

γ-World is a generative multi-agent world model that supports independently controllable, permutation-symmetric agents using Simplex Rotary Agent Encoding and Sparse Hub Attention, achieving real-time 24 FPS rollouts and zero-shot generalization from two to four players.

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Hugging Face Daily Papers

Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.

Agora-1: The Multi-Agent World Model

Hacker News Top

Odyssey introduces Agora-1, a multi-agent world model that enables real-time shared simulations for multiple participants, demonstrated with a GoldenEye deathmatch game.