SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models
Summary
SOCO benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.
View Cached Full Text
Cached at: 06/02/26, 03:35 PM
Paper page - SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models
Source: https://huggingface.co/papers/2605.31597
Abstract
Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.
Measuringstructured object understandinginvision foundation modelsremains challenging due to inconsistent evaluation protocols and limitedpart-level supervision.Semantic correspondence(SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy ofcorrespondence typesand provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of largevision-language models(LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on densedownstream tasks, includingsegmentation,tracking,3D pose estimation, and3D detection, more strongly thanImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.31597
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.31597 in a model README.md to link it from this page.
Datasets citing this paper1
#### GenIntelLab/SOCO Updatedabout 22 hours ago • 20 • 1
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31597 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MARCO: Navigating the Unseen Space of Semantic Correspondence
MARCO introduces a compact, fast model for semantic correspondence that achieves state-of-the-art accuracy and generalization to unseen keypoints using a coarse-to-fine objective and self-distillation framework with DINOv2.
KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models
This paper introduces KODA (Kernel Optimization for Discrepancy Analysis), a kernel-based framework for comparing and aligning vision-language model representations by identifying sample subsets that are clustered differently across models like CLIP, SigLIP, and BLIP. The method uses contrastive embedding clustering and randomized low-dimensional approximations to scale to large datasets while providing interpretable structural differences between representations.
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models
RoboSemanticBench is a benchmark that diagnoses semantic grounding in action prediction for vision-language-action models, revealing that while robots can grasp objects, they fail to select semantically correct targets based on instruction semantics.
Weakly Supervised Concept Learning for Object-centric Visual Reasoning
This paper introduces a two-stage neuro-symbolic framework that uses weak supervision (as little as 1% labels) with a slot-based VAE to learn interpretable symbols for object-centric visual reasoning, outperforming foundation models in domain generalization.
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
OSCBench is a new benchmark designed to evaluate text-to-video generation models' ability to accurately represent object state changes (transformations caused by actions like peeling or slicing). The paper reveals that current T2V models struggle with temporally consistent state changes, especially in novel and compositional scenarios, identifying this as a key bottleneck in video generation.