OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
Summary
A unified framework for camera motion cloning using grid motion videos and multimodal diffusion transformers, enabling director-level control without cross-paired data.
View Cached Full Text
Cached at: 06/15/26, 09:04 AM
Paper page - OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
Source: https://huggingface.co/papers/2606.13432
Abstract
A unified framework for camera motion cloning that uses grid motion videos as representation and integrates multimodal diffusion transformers for enhanced video generation control.
Cloning camera motion from reference videos is an important task invideo generation, as videos provide intuitive and precise control. Existing methods either directly useparametric representationsthat fail to handle multi-shot generation or synthesizecross-paired data, which suffer from data scarcity, resulting in poor performance in complicatedcamera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras asgrid motion videos. Thiscamera gridrepresents thecamera parametersvisually and supports the integration of diverse trajectories for multi-shotvideo generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scalecamera grid-video pairs that coordinates characters, actions, and cameras to providedirector-level controlformultimodal diffusion transformers. Furthermore, we design a novelhierarchical prompt expansion agentthat harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/
View arXiv pageView PDFProject pageGitHub20Add to collection
Get this paper in your agent:
hf papers read 2606\.13432
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.13432 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.13432 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.13432 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid is a framework that enables scalable cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, using unpaired data and branch-isolated attention to reduce interference.
Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks
Track2View generates novel camera viewpoints from videos by conditioning a video diffusion transformer on paired 3D point tracks, achieving state-of-the-art visual quality and significant reductions in rotation and translation errors.
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Pantheon360 introduces a 3D-aware 360° video diffusion framework that uses an explicit 3D cache to enforce geometric consistency, enabling high-fidelity digital twin generation from sparse 360° inputs.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R repurposes video diffusion transformers for dense 3D tracking from monocular video, using dual-latent representation and temporal RoPE alignment to achieve state-of-the-art performance with 1.3x faster speed and 4.6x less peak memory than prior methods.
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.