SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Hugging Face Daily Papers Papers

Summary

SOCO benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Source: https://huggingface.co/papers/2605.31597

Abstract

Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.

Measuringstructured object understandinginvision foundation modelsremains challenging due to inconsistent evaluation protocols and limitedpart-level supervision.Semantic correspondence(SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy ofcorrespondence typesand provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of largevision-language models(LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on densedownstream tasks, includingsegmentation,tracking,3D pose estimation, and3D detection, more strongly thanImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.

View arXiv pageView PDFProject pageGitHub3Add to collection

Get this paper in your agent:

hf papers read 2605\.31597

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31597 in a model README.md to link it from this page.

Datasets citing this paper1

#### GenIntelLab/SOCO Updatedabout 22 hours ago • 20 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31597 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MARCO: Navigating the Unseen Space of Semantic Correspondence

Hugging Face Daily Papers

MARCO introduces a compact, fast model for semantic correspondence that achieves state-of-the-art accuracy and generalization to unseen keypoints using a coarse-to-fine objective and self-distillation framework with DINOv2.

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

arXiv cs.LG

This paper introduces KODA (Kernel Optimization for Discrepancy Analysis), a kernel-based framework for comparing and aligning vision-language model representations by identifying sample subsets that are clustered differently across models like CLIP, SigLIP, and BLIP. The method uses contrastive embedding clustering and randomized low-dimensional approximations to scale to large datasets while providing interpretable structural differences between representations.

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

arXiv cs.CL

OSCBench is a new benchmark designed to evaluate text-to-video generation models' ability to accurately represent object state changes (transformations caused by actions like peeling or slicing). The paper reveals that current T2V models struggle with temporally consistent state changes, especially in novel and compositional scenarios, identifying this as a key bottleneck in video generation.