OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

cross-embodiment video-generation motion-transfer humanoid-robotics embodied-ai diffusion-models adaptation

Summary

OmniHumanoid is a framework that enables scalable cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, using unpaired data and branch-isolated attention to reduce interference.

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.

Original Article

View Cached Full Text

Cached at: 05/18/26, 06:24 AM

Paper page - OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Source: https://huggingface.co/papers/2605.12038

Abstract

OmniHumanoid enables cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, allowing scalable adaptation to new humanoid embodiments using unpaired data.

Cross-embodiment video generationaims to transfer motions across differenthumanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning andembodiment-specific adaptation. Our method learns a sharedmotion transfermodel frommotion-aligned paired videosspanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference betweenmotion transferand embodiment adaptation, we further introduce abranch-isolated attentiondesign that separates motion conditioning from embodiment-specific modulation. In addition, we construct asynthetic cross-embodiment datasetwithmotion-aligned paired videosrendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strongmotion fidelityandembodiment consistency, while enabling scalable adaptation to unseenhumanoid embodimentswithout retraining the shared motion model.

View arXiv page View PDF GitHub1 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12038 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12038 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12038 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Paper page - OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Submit Feedback

Similar Articles

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition