MARCO: Navigating the Unseen Space of Semantic Correspondence
Summary
MARCO introduces a compact, fast model for semantic correspondence that achieves state-of-the-art accuracy and generalization to unseen keypoints using a coarse-to-fine objective and self-distillation framework with DINOv2.
View Cached Full Text
Cached at: 04/21/26, 07:46 PM
Paper page - MARCO: Navigating the Unseen Space of Semantic Correspondence
Source: https://huggingface.co/papers/2604.18267
Abstract
MARCO is a compact, fast model that improves semantic correspondence accuracy and generalization beyond training data by using a coarse-to-fine objective and self-distillation framework with DINOv2 and diffusion backbones.
Recent advances insemantic correspondencerely ondual-encoder architectures, combiningDINOv2withdiffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building uponDINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances bothfine-grained localizationandsemantic generalization. By coupling acoarse-to-fine objectivethat refines spatial precision with aself-distillation framework, which expandssparse supervisionbeyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify atfine-grained localizationthresholds (+8.9 [email protected]), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches. Code is available at https://github.com/visinf/MARCO .
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2604\.18267
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18267 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18267 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18267 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction
The paper introduces Direct Corpus Interaction (DCI), a novel approach allowing AI agents to query raw text directly using standard terminal tools instead of traditional embedding-based retrieval. By bypassing fixed similarity interfaces and offline indexing, DCI significantly outperforms conventional sparse, dense, and reranking baselines across multiple IR and agentic search benchmarks.
DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition
DiZiNER is a framework that uses disagreement between multiple LLMs to refine task instructions for zero-shot named entity recognition, achieving state-of-the-art results on 14 out of 18 benchmarks and significantly reducing the performance gap between zero-shot and supervised systems.
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.
RemoteZero: Geospatial Reasoning with Zero Human Annotations
RemoteZero is a framework that eliminates the need for human-annotated box supervision in geospatial reasoning by leveraging the semantic verification capabilities of multimodal large language models (MLLMs) to enable self-evolving localization from unlabeled remote sensing data.
Meta-learning In-Context Enables Training-Free Cross Subject Brain Decoding
This paper introduces a meta-optimized approach for semantic visual decoding from fMRI signals that generalizes to novel subjects without fine-tuning, using in-context learning to infer unique neural encoding patterns from a small set of image-brain activation examples. The method achieves strong cross-subject and cross-scanner generalization without requiring anatomical alignment or stimulus overlap.