MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Summary
MoCapAnything V2 introduces a fully end-to-end framework for arbitrary-skeleton motion capture from monocular video, jointly optimizing video-to-pose and pose-to-rotation predictions to resolve rotation ambiguity.
View Cached Full Text
Cached at: 05/08/26, 09:04 AM
Paper page - MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Source: https://huggingface.co/papers/2604.28130 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
A fully end-to-end framework for arbitrary-skeleton motion capture that jointly optimizes video-to-pose and pose-to-rotation prediction while addressing rotation ambiguity through reference pose-rotation pairs and skeleton-aware attention mechanisms.
Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where aVideo-to-Pose networkpredictsjoint positionsand an analyticalinverse-kinematics(IK) stage recoversjoint rotations. While effective, this design is inherently limited, sincejoint positionsdo not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fullyend-to-end frameworkin which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missingcoordinate system information: the samejoint positionscan correspond to different rotations under differentrest poses and local axis conventions. To resolve this, we introduce areference pose-rotation pairfrom the target asset, which, together with therest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turnsrotation predictioninto a well-constrained conditional problem and enables effective learning. In addition, our model predictsjoint positionsdirectly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share askeleton-aware Global-Local Graph-guided Multi-Head Attention(GL-GMHA) module forjoint-level local reasoningandglobal coordination. Experiments on Truebones Zoo and Objaverse show that our method reducesrotation errorfrom ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/
View arXiv pageView PDFProject pageGitHub203Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.28130 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.28130 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.28130 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
AnyMo is a geometry-aware framework for setup-agnostic human motion modeling using physics-grounded IMU simulation and graph encoding, achieving significant improvements in zero-shot activity recognition, cross-modal retrieval, and motion captioning across multiple datasets.
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
This paper introduces AnyMo, a unified multimodal framework for human motion generation that combines a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, along with the OmniHuMo dataset of over 5,000 hours of motion data to enable high-quality synthesis under arbitrary modality combinations.
@axichuhai: This free and open-source 3D motion capture tool, freemocap, has garnered 9K stars on GitHub. No professional capture equipment needed, just a few ordinary cameras. It transforms multi-view geometry problems into computer vision tasks, using spatial calibration algorithms + deep learning models to extract precise 3D human skeleton data from 2D footage of multiple ordinary cameras…
Freemocap is a free and open-source 3D motion capture tool. It requires only ordinary cameras to reconstruct precise 3D human skeleton data using spatial calibration and deep learning models, supporting multiple export formats.
MolmoMotion: Language-guided 3D motion forecasting
MolmoMotion is a new language-guided 3D motion forecasting model that predicts future 3D point trajectories from video frames and action descriptions, achieving stronger performance than existing methods. Alongside the model, a large dataset (MolmoMotion-1M) and a benchmark (PointMotionBench) are released.
@andrew_n_carr: The term "markerless" gets thrown around a lot in motion capture. What does it mean? Well...surprise! There are still m…
Explains that 'markerless' motion capture still uses markers but estimates them via webcam instead of expensive equipment, enabling multi-person, scale-aware capture.