MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Summary
MoCapAnything V2 introduces a fully end-to-end framework for arbitrary-skeleton motion capture from monocular video, jointly optimizing video-to-pose and pose-to-rotation predictions to resolve rotation ambiguity.
View Cached Full Text
Cached at: 05/08/26, 09:04 AM
Paper page - MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
Source: https://huggingface.co/papers/2604.28130 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
A fully end-to-end framework for arbitrary-skeleton motion capture that jointly optimizes video-to-pose and pose-to-rotation prediction while addressing rotation ambiguity through reference pose-rotation pairs and skeleton-aware attention mechanisms.
Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where aVideo-to-Pose networkpredictsjoint positionsand an analyticalinverse-kinematics(IK) stage recoversjoint rotations. While effective, this design is inherently limited, sincejoint positionsdo not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fullyend-to-end frameworkin which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missingcoordinate system information: the samejoint positionscan correspond to different rotations under differentrest poses and local axis conventions. To resolve this, we introduce areference pose-rotation pairfrom the target asset, which, together with therest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turnsrotation predictioninto a well-constrained conditional problem and enables effective learning. In addition, our model predictsjoint positionsdirectly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share askeleton-aware Global-Local Graph-guided Multi-Head Attention(GL-GMHA) module forjoint-level local reasoningandglobal coordination. Experiments on Truebones Zoo and Objaverse show that our method reducesrotation errorfrom ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/
View arXiv pageView PDFProject pageGitHub203Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.28130 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.28130 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.28130 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.
MolmoAct 2
MolmoAct 2 is an open robotics model that reasons in 3D space before taking actions, developed by the Allen Institute for Artificial Intelligence.
SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning
Meta AI releases SAM 3.1, an update to the Segment Anything Model that enhances real-time video detection and tracking through multiplexing and global reasoning capabilities.
MolmoAct2: Action Reasoning Models for Real-world Deployment
Allen AI releases MolmoAct2, an open-weight Vision-Language-Action model designed for real-world robotic deployment, featuring new datasets, an open action tokenizer, and adaptive reasoning to reduce latency.