MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Papers with Code Trending 04/30/26, 12:00 AM Papers

Summary

MoCapAnything V2 introduces a fully end-to-end framework for arbitrary-skeleton motion capture from monocular video, jointly optimizing video-to-pose and pose-to-rotation predictions to resolve rotation ambiguity.

Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 09:04 AM

Paper page - MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Source: https://huggingface.co/papers/2604.28130 Authors:

Abstract

A fully end-to-end framework for arbitrary-skeleton motion capture that jointly optimizes video-to-pose and pose-to-rotation prediction while addressing rotation ambiguity through reference pose-rotation pairs and skeleton-aware attention mechanisms.

Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where aVideo-to-Pose networkpredictsjoint positionsand an analyticalinverse-kinematics(IK) stage recoversjoint rotations. While effective, this design is inherently limited, sincejoint positionsdo not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fullyend-to-end frameworkin which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missingcoordinate system information: the samejoint positionscan correspond to different rotations under differentrest poses and local axis conventions. To resolve this, we introduce areference pose-rotation pairfrom the target asset, which, together with therest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turnsrotation predictioninto a well-constrained conditional problem and enables effective learning. In addition, our model predictsjoint positionsdirectly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share askeleton-aware Global-Local Graph-guided Multi-Head Attention(GL-GMHA) module forjoint-level local reasoningandglobal coordination. Experiments on Truebones Zoo and Objaverse show that our method reducesrotation errorfrom ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/

View arXiv page View PDF Project page GitHub203 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.28130 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.28130 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.28130 in a Space README.md to link it from this page.

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Paper page - MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

MolmoAct 2

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

MolmoAct2: Action Reasoning Models for Real-world Deployment

Submit Feedback

Similar Articles

DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

MolmoAct2: Action Reasoning Models for Real-world Deployment