EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
Summary
EgoForce is a monocular 3D hand reconstruction framework that uses a unified network with differentiable forearm representation, arm-hand transformers, and ray space solvers to recover absolute hand pose and position across different camera models, achieving state-of-the-art accuracy on egocentric benchmarks.
View Cached Full Text
Cached at: 05/13/26, 08:14 PM
Paper page - EgoForce: Forearm-Guided Camera-Space 3D Hand Pose from a Monocular Egocentric Camera
Source: https://huggingface.co/papers/2605.12498
Abstract
EgoForce is a monocular 3D hand reconstruction framework that uses a unified network to recover robust, absolute hand pose and position across different camera models through differentiable forearm representation, arm-hand transformers, and ray space solvers.
Reconstructing the absolute 3D pose and shape of the hands from the user’s viewpoint using a single head-mounted camera is crucial for practical egocentric interaction in AR/VR, telepresence, and hand-centric manipulation tasks, where sensing must remain compact and unobtrusive. While monocular RGB methods have made progress, they remain constrained bydepth-scale ambiguityand struggle to generalize across the diverse optical configurations of head-mounted devices. As a result, models typically require extensive training on device-specific datasets, which are costly and laborious to acquire. This paper addresses these challenges by introducing EgoForce, amonocular 3D hand reconstructionframework that recovers robust, absolute 3D hand pose and its position from the user’s (camera-space) viewpoint. EgoForce operates across fisheye, perspective, anddistorted wide-FOV camera models using a single unified network. Our approach combines adifferentiable forearm representationthat stabilizes hand pose, a unifiedarm-hand transformerthat predicts both hand and forearm geometry from a single egocentric view, mitigatingdepth-scale ambiguity, and aray space closed-form solverthat enables absolute 3D pose recovery across diverse head-mounted camera models. Experiments on three egocentric benchmarks show that EgoForce achieves state-of-the-art 3D accuracy, reducing camera-space MPJPE by up to 28% on the HOT3D dataset compared to prior methods and maintaining consistent performance across camera configurations. For more details, visit the project page at https://dfki-av.github.io/EgoForce.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.12498
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12498 in a model README.md to link it from this page.
Datasets citing this paper1
#### chris10/EgoForce Updated23 minutes ago • 11.4k
Spaces citing this paper1
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons
MoCapAnything V2 introduces a fully end-to-end framework for arbitrary-skeleton motion capture from monocular video, jointly optimizing video-to-pose and pose-to-rotation predictions to resolve rotation ambiguity.
FaithfulFaces: Pose-Faithful Facial Identity Preservation for Text-to-Video Generation
FaithfulFaces is a new framework for text-to-video generation that preserves facial identity consistency across varying poses and occlusions using pose-shared alignment and Euler angle embeddings.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
DeVI introduces a framework that turns text-conditioned synthetic videos into physically plausible dexterous robot control via a hybrid 3D-2D tracking reward, enabling zero-shot generalization to unseen objects.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
Parameter-Efficient Multi-View Proficiency Estimation: From Discriminative Classification to Generative Feedback
This paper introduces three parameter-efficient methods for multi-view proficiency estimation on the Ego-Exo4D dataset, shifting from discriminative classification to generative feedback. The proposed models achieve state-of-the-art accuracy with significantly fewer parameters and training epochs than video-transformer baselines.