Holo-World: Unified Camera, Object and Weather Control for Video World Model
Summary
Holo-World presents a unified controllable video world model that generates videos from a single image with explicit control over camera, object motion, and weather. It introduces a novel dataset and techniques to preserve scene structure while transferring to target weather states.
View Cached Full Text
Cached at: 06/20/26, 02:29 PM
Paper page - Holo-World: Unified Camera, Object and Weather Control for Video World Model
Source: https://huggingface.co/papers/2606.20083
Abstract
A unified controllable video world model generates videos from a single image while preserving scene structure and transferring to target weather states through specialized parameterization and conditioning techniques.
Video world modelsare moving toward preserving an observed world under controllable camera andobject motionwhile allowing its environmental state to change. Yet these controls remain isolated, andweather generationtypically relies on a source video or reconstructed scene that already specifies future structure. We study a first-frame-anchoredsource-to-state setting, where the model starts from a single image and follows explicit camera and object controls and an optional weather instruction, then generates a video that either preserves the source world or transfers it to a target weather state. To address these challenges, we first buildHoloStateData, a state video dataset that turns diverse videos into unified control samples for camera, object, and weather supervision. Second, we introduceHolo-World, a unified controllable video world model that jointly controls scene from a single image. ItsUnified Scene Adapterfactorizes world preservation andweather transferinto distinct parameter subspaces, usingrendered background,geometry buffers, and object controls to maintain controlled scene structure while modeling weather-dependent appearance and particle effects. Additionally,Scene-Weather Decomposed CFGguides scene and weather residuals separately, strengthening target weather effects without over-amplifying the full condition. Quantitative and qualitative experiments demonstrate thatHolo-Worldmaintains precise camera and object control with consistent scene structure while transferring scenes into diverse target weather state, outperformingvideo-to-video weather editingbaselines on weather-state generation. Our project page is available at https://xiangchenyin.github.io/Holo-World/.
View arXiv pageView PDFProject pageGitHub4Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20083 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.20083 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20083 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 is a multi-modal world model framework that generates high-fidelity 3D Gaussian Splatting scenes from text, images, and videos through specialized modules for panorama generation, trajectory planning, and scene composition, achieving state-of-the-art performance among open-source approaches.
WorldCraft: From Camera Navigation to Object Manipulation in Interactive Video World Models
WorldCraft extends interactive video world models to enable object-level trajectory control while maintaining camera navigation capabilities through specialized control pipelines.
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a unified framework for multi-agent multi-view video world modeling that achieves accurate control of multiple agents while maintaining multi-view consistency through a Multi-Agent Condition Module and Global State Encoder.
minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
minWM is a full-stack open-source framework that converts bidirectional video diffusion models into real-time interactive video world models with controllable camera, low-latency rollout, and modular architecture.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
SANA-WM is a 2.6B-parameter open-source world model that generates high-fidelity 720p minute-scale videos with precise camera control, achieving industrial-level quality while significantly reducing computational requirements.