Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence
Summary
This paper introduces a post-training framework that leverages 3D priors from SAM3D to improve semantic correspondence in 2D foundation features, addressing issues like left-right confusion and repeated parts. The method uses instance-specific 3D reconstruction without pose annotations or spherical geometry shortcuts.
View Cached Full Text
Cached at: 05/29/26, 11:01 AM
Paper page - Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence
Source: https://huggingface.co/papers/2605.30093 🧵New work: 3D-Aware Semantic Correspondence 2D foundation features (DINO, Stable Diffusion) are powerful for semantic correspondence — but they have a blind spot: they can’t tell left from right, or distinguish repeated parts that are clearly separate in 3D. We introduce a post-training framework that brings in 3D priors to fix this.
How it works: Given an image, we use SAM3D to reconstruct object geometry and estimate pose, then refine via render-and-compare. PartField descriptors are rendered into the image plane and combined with DINO + SD features. Geodesic distances on the reconstructed shape filter unreliable matches — and the filtered correspondences supervise a lightweight adapter.
What’s different from prior work: No pose annotations. No spherical geometry shortcuts. Instance-specific 3D structure, recovered automatically.
**Results:**Improved semantic correspondence over prior post-training methods, with less manual supervision. Code + model: github.com/GenIntel/3D-SC (Coming soon)
Similar Articles
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
This paper proposes GASP, a framework that injects geometric priors into vision-language models via deep supervision with contrastive and depth consistency losses, achieving significant improvements on 3D spatial reasoning benchmarks without using 3D VQA data.
See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation
This paper introduces OmniManim, a render-feedback-aware framework for generating educational animations from natural language descriptions using large language models. It addresses visual defects like element overlap and misalignment by incorporating explicit visual planning, post-render diagnostics, and localized repair, demonstrating improved render quality on newly constructed datasets.
MARCO: Navigating the Unseen Space of Semantic Correspondence
MARCO introduces a compact, fast model for semantic correspondence that achieves state-of-the-art accuracy and generalization to unseen keypoints using a coarse-to-fine objective and self-distillation framework with DINOv2.
Improving Relative Representations with Learned Anchors and Whitened Inner Products
This paper proposes improvements to Relative Representations by learning robust semantic anchors and using a geometry-aware similarity metric, enabling nearly lossless information transfer and stable zero-shot communication between independently trained models of varying architectures.
Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors
Stream3D-VLM is an online 3D vision-language model that enables real-time spatial understanding from streaming video by incrementally integrating geometry priors and using geometry-adaptive voxel compression, outperforming existing models on 3D spatial understanding tasks.