TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
Summary
TrackCraft3R repurposes video diffusion transformers for dense 3D tracking from monocular video, using dual-latent representation and temporal RoPE alignment to achieve state-of-the-art performance with 1.3x faster speed and 4.6x less peak memory than prior methods.
View Cached Full Text
Cached at: 05/14/26, 08:18 AM
Paper page - TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
Source: https://huggingface.co/papers/2605.12587 Published on May 12
·
Submitted byhttps://huggingface.co/frog123123123123
frogon May 14
Abstract
TrackCraft3R enables efficient dense 3D tracking from monocular video by adapting video diffusion transformers to follow physical points across frames using dual-latent representation and temporal RoPE alignment.
Dense 3D trackingfrom monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trainedvideo diffusion transformers(video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame’s content, is fundamentally mismatched with reference-anchoreddense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts areference-anchored trackingpointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that usesper-frame geometry latentsand reference-anchoredtrack latentsas dense queries, and (ii)temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm ofvideo DiTsinto areference-anchored trackingformulation withLoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse anddense 3D trackingbenchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.
View arXiv pageView PDFProject pageGitHub28Add to collection
Get this paper in your agent:
hf papers read 2605\.12587
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12587 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12587 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12587 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Pantheon360 introduces a 3D-aware 360° video diffusion framework that uses an explicit 3D cache to enforce geometric consistency, enabling high-fidelity digital twin generation from sparse 360° inputs.
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth.
Geometric Context Transformer for Streaming 3D Reconstruction
Introduces LingBot-Map, a feed-forward 3D foundation model for streaming 3D reconstruction using a geometric context transformer architecture that achieves stable real-time performance at 20 FPS.