OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Hugging Face Daily Papers Papers

Summary

OmniHumanoid is a framework that enables scalable cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, using unpaired data and branch-isolated attention to reduce interference.

Cross-embodiment video generation aims to transfer motions across different humanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning and embodiment-specific adaptation. Our method learns a shared motion transfer model from motion-aligned paired videos spanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference between motion transfer and embodiment adaptation, we further introduce a branch-isolated attention design that separates motion conditioning from embodiment-specific modulation. In addition, we construct a synthetic cross-embodiment dataset with motion-aligned paired videos rendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strong motion fidelity and embodiment consistency, while enabling scalable adaptation to unseen humanoid embodiments without retraining the shared motion model.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:24 AM

Paper page - OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation

Source: https://huggingface.co/papers/2605.12038

Abstract

OmniHumanoid enables cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, allowing scalable adaptation to new humanoid embodiments using unpaired data.

Cross-embodiment video generationaims to transfer motions across differenthumanoid embodiments, such as human-to-robot and robot-to-robot, enabling scalable data generation for embodied intelligence. A major challenge in this setting is that motion dynamics are partly transferable across embodiments, whereas appearance and morphology remain embodiment-specific. Existing approaches often entangle these factors, and many require paired data for every target embodiment, which limits scalability to new robots. We present OmniHumanoid, a framework that factorizes transferable motion learning andembodiment-specific adaptation. Our method learns a sharedmotion transfermodel frommotion-aligned paired videosspanning multiple embodiments, while adapting to a new embodiment using only unpaired videos through lightweight embodiment-specific adapters. To reduce interference betweenmotion transferand embodiment adaptation, we further introduce abranch-isolated attentiondesign that separates motion conditioning from embodiment-specific modulation. In addition, we construct asynthetic cross-embodiment datasetwithmotion-aligned paired videosrendered across diverse humanoid assets, scenes, and viewpoints. Experiments on both synthetic and real-world benchmarks show that OmniHumanoid achieves strongmotion fidelityandembodiment consistency, while enabling scalable adaptation to unseenhumanoid embodimentswithout retraining the shared motion model.

View arXiv pageView PDFGitHub1Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12038 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12038 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12038 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Hugging Face Daily Papers

This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Hugging Face Daily Papers

CogOmniControl is a reasoning-driven framework for controllable video generation that uses a specialized vision-language model (CogVLM) trained on anime production data to infer creative intent from sparse conditions, then guides a diffusion-based generator via reinforcement learning, achieving state-of-the-art results on new benchmarks.