Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks
Summary
Track2View generates novel camera viewpoints from videos by conditioning a video diffusion transformer on paired 3D point tracks, achieving state-of-the-art visual quality and significant reductions in rotation and translation errors.
View Cached Full Text
Cached at: 06/16/26, 07:33 PM
Paper page - Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks
Source: https://huggingface.co/papers/2606.15534
Abstract
Track2View generates novel camera viewpoints from videos by using 3D point tracks to establish explicit spatiotemporal correspondences, achieving superior visual quality and camera accuracy compared to existing methods.
Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribedcamera trajectorywhile preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions avideo diffusion transformeron paired3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicitspatiotemporal correspondencesthat are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is adual-view track conditionerthat transfers visual context from source to target view through parameter-free geometric operations and learnedtemporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a3D point trackeron temporally concatenatedmulti-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2606\.15534
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.15534 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.15534 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.15534 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R repurposes video diffusion transformers for dense 3D tracking from monocular video, using dual-latent representation and temporal RoPE alignment to achieve state-of-the-art performance with 1.3x faster speed and 4.6x less peak memory than prior methods.
Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion
Pantheon360 introduces a 3D-aware 360° video diffusion framework that uses an explicit 3D cache to enforce geometric consistency, enabling high-fidelity digital twin generation from sparse 360° inputs.
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth.
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
A unified framework for camera motion cloning using grid motion videos and multimodal diffusion transformers, enabling director-level control without cross-paired data.
AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
AnyRecon proposes a scalable framework for 3D reconstruction from arbitrary sparse inputs using a video diffusion model with persistent scene memory and geometry-aware conditioning.