OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
Summary
OmniHumanoid is a framework that enables scalable cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, using unpaired data and branch-isolated attention to reduce interference.
View Cached Full Text
Cached at: 05/18/26, 06:24 AM
Paper page - OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
Source: https://huggingface.co/papers/2605.12038
Abstract
OmniHumanoid enables cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, allowing scalable adaptation to new humanoid embodiments using unpaired data.
Cross-embodiment video generationaims to transfer motions across differenthumanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning andembodiment-specific adaptation. Our method learns a sharedmotion transfermodel frommotion-aligned paired videosspanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference betweenmotion transferand embodiment adaptation, we further introduce abranch-isolated attentiondesign that separates motion conditioning from embodiment-specific modulation. In addition, we construct asynthetic cross-embodiment datasetwithmotion-aligned paired videosrendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strongmotion fidelityandembodiment consistency, while enabling scalable adaptation to unseenhumanoid embodimentswithout retraining the shared motion model.
View arXiv pageView PDFGitHub1Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12038 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12038 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12038 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
OmniPro is the first benchmark for evaluating proactive streaming video understanding in omni-modal large language models, featuring 2,700 samples covering diverse tasks and dual-mode evaluation protocols.
LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing
LoomVideo introduces a 5B-parameter unified architecture for video generation and editing that reduces computational overhead using novel conditioning mechanisms and multi-modal alignment, achieving competitive performance and faster inference.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract introduces an end-to-end Diffusion Transformer framework that jointly models RGB appearance and HOI geometry to generate physically-plausible human-object interaction videos with stable hands/faces and zero inference overhead.
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
CogOmniControl is a reasoning-driven framework for controllable video generation that uses a specialized vision-language model (CogVLM) trained on anime production data to infer creative intent from sparse conditions, then guides a diffusion-based generator via reinforcement learning, achieving state-of-the-art results on new benchmarks.