AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

human-motion-generation multimodal masked-modeling motion-tokenizer conditional-generation scaling dataset

Summary

This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.

Original Article

View Cached Full Text

Cached at: 06/01/26, 07:18 AM

Paper page - AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Source: https://huggingface.co/papers/2605.29488

Abstract

A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.

Conditionalhuman motion generationremains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leavingcross-modal interactionsand the scaling laws ofmultimodal-conditioned synthesislargely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining aResidual FSQ-basedmotion tokenizerwith a scalablemasked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieveshigh-fidelity synthesiswhile offering flexible control over both spatial and stylistic attributes.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.29488

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29488 in a model README.md to link it from this page.

Datasets citing this paper1

#### L-yiheng/OmniHuMo Updatedabout 17 hours ago • 22

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29488 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Paper page - AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

unsloth/MiMo-V2.5-GGUF · Hugging Face

MolmoMotion: Language-guided 3D motion forecasting

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

MODUS: Decoder-Only Any-to-Any Modeling of Diverse Modalities

Submit Feedback

Similar Articles

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

unsloth/MiMo-V2.5-GGUF · Hugging Face

MolmoMotion: Language-guided 3D motion forecasting

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

MODUS: Decoder-Only Any-to-Any Modeling of Diverse Modalities