AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Summary
This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.
View Cached Full Text
Cached at: 06/01/26, 07:18 AM
Paper page - AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Source: https://huggingface.co/papers/2605.29488
Abstract
A unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer to enable high-quality synthesis across arbitrary modality combinations.
Conditionalhuman motion generationremains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leavingcross-modal interactionsand the scaling laws ofmultimodal-conditioned synthesislargely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining aResidual FSQ-basedmotion tokenizerwith a scalablemasked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieveshigh-fidelity synthesiswhile offering flexible control over both spatial and stylistic attributes.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.29488
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29488 in a model README.md to link it from this page.
Datasets citing this paper1
#### L-yiheng/OmniHuMo Updatedabout 17 hours ago • 22
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29488 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
AnyMo is a geometry-aware framework for setup-agnostic human motion modeling using physics-grounded IMU simulation and graph encoding, achieving significant improvements in zero-shot activity recognition, cross-modal retrieval, and motion captioning across multiple datasets.
unsloth/MiMo-V2.5-GGUF · Hugging Face
MiMo-V2.5 is a native omnimodal AI model with strong agentic capabilities, supporting text, image, video, and audio understanding within a unified sparse MoE architecture.
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
MoCapAnything V2 introduces a fully end-to-end framework for arbitrary-skeleton motion capture from monocular video, jointly optimizing video-to-pose and pose-to-rotation predictions to resolve rotation ambiguity.
OmniHumanoid: Streaming Cross-Embodiment Video Generation with Paired-Free Adaptation
OmniHumanoid is a framework that enables scalable cross-embodiment video generation by factorizing motion transfer and embodiment-specific adaptation, using unpaired data and branch-isolated attention to reduce interference.
LongMoE: Longitudinal Multimodal Learning via Trajectory-Aware Mixture-of-Experts
LongMoE proposes a unified framework that jointly addresses modality missingness and longitudinal dynamics in multimodal clinical learning, using context-aware imputation, attentional tokenization, trajectory-aware encoding, and sparse mixture-of-experts routing. Experiments on ADNI, OASIS-3, and MIMIC-IV demonstrate improved robustness under missing modalities while remaining competitive in full-modality settings.