Go-with-the-Track: Video Compositing and Motion Control with Point Tracking
Summary
Go-with-the-Track unifies motion control and reference image compositing in video generation using point-track embeddings with spatial-aware encoding and video diffusion transformers, achieving superior motion and reference control in a single model.
View Cached Full Text
Cached at: 06/23/26, 05:43 PM
Paper page - Go-with-the-Track: Video Compositing and Motion Control with Point Tracking
Source: https://huggingface.co/papers/2606.20891 Authors:
,
,
,
,
,
,
,
,
,
,
Abstract
Go-with-the-Track unifies motion control and reference image compositing in video generation by using point-track embeddings with spatial-aware encoding and video diffusion transformers.
Filmmaking demands precise motion control and reference image compositing -- capabilities that existing methods treat separately.Point-track-conditioned image-to-video modelsrestrict content insertion to the first frame, whilereference-to-video modelslack fine-grained spatial-temporal control over how reference content integrates across frames. We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks -- extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video. To achieve this, we introducespatially-aware point-track embeddingsthat encode the full sequence of point-track coordinates using acoordinate-wise MLPfollowed bytemporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model’s ability to distinguish and associate point-tracks. We inject these point-track embeddings into avideo diffusion transformervia alightweight adapter, resolving thepixel-to-patch resolution mismatchwhile avoiding the substantial motion detail loss inherent in naive point-track subsampling. We use ahybrid training strategyto train jointly on dynamic, static, and synthetic scene video datasets to boostmotion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities:multi-reference conditioned video generationwith point-track driven compositing, as well ascamera controlfor both static and dynamic scenes. Project Page: https://eyeline-labs.github.io/Go-with-the-Track/
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2606\.20891
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.20891 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.20891 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.20891 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks
Track2View generates novel camera viewpoints from videos by conditioning a video diffusion transformer on paired 3D point tracks, achieving state-of-the-art visual quality and significant reductions in rotation and translation errors.
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R repurposes video diffusion transformers for dense 3D tracking from monocular video, using dual-latent representation and temporal RoPE alignment to achieve state-of-the-art performance with 1.3x faster speed and 4.6x less peak memory than prior methods.
LooseControlVideo: Directorial Video Control using Spatial Blocking
LooseControlVideo introduces a framework for intuitive 3D spatial control in text-to-video generation using sparse oriented 3D boxes as proxies, achieving superior trajectory accuracy and occlusion handling. It fine-tunes a Wan 2.2 backbone and demonstrates significant improvements over existing methods on multiple benchmarks.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that uses vision-language reasoning to refine trajectories and a confidence-aware control scheme to improve plausibility, outperforming existing approaches on a new benchmark.
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
CogOmniControl is a reasoning-driven framework for controllable video generation that uses a specialized vision-language model (CogVLM) trained on anime production data to infer creative intent from sparse conditions, then guides a diffusion-based generator via reinforcement learning, achieving state-of-the-art results on new benchmarks.