AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Hugging Face Daily Papers Papers

Summary

This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.

Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.
Original Article
View Cached Full Text

Cached at: 06/01/26, 07:18 AM

Paper page - AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling

Source: https://huggingface.co/papers/2605.29488

Abstract

A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.

Conditionalhuman motion generationremains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leavingcross-modal interactionsand the scaling laws ofmultimodal-conditioned synthesislargely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining aResidual FSQ-basedmotion tokenizerwith a scalablemasked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieveshigh-fidelity synthesiswhile offering flexible control over both spatial and stylistic attributes.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.29488

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29488 in a model README.md to link it from this page.

Datasets citing this paper1

#### L-yiheng/OmniHuMo Updatedabout 17 hours ago • 22

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29488 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild

Hugging Face Daily Papers

AnyMo is a geometry-aware framework for setup-agnostic human motion modeling using physics-grounded IMU simulation and graph encoding, achieving significant improvements in zero-shot activity recognition, cross-modal retrieval, and motion captioning across multiple datasets.

unsloth/MiMo-V2.5-GGUF · Hugging Face

Reddit r/LocalLLaMA

MiMo-V2.5 is a native omnimodal AI model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified sparse MoE architecture.

LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts

arXiv cs.LG

LongMoE proposes a unified framework that jointly addresses modality missingness and longitudinal dynamics in multimodal clinical learning, using context-aware imputation, attentional tokenization, trajectory-aware encoding, and sparse mixture-of-experts routing. Experiments on ADNI, OASIS-3, and MIMIC-IV demonstrate improved robustness under missing modalities while remaining competitive in full-modality settings.