SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Hugging Face Daily Papers 05/29/26, 12:00 AM Papers

Summary

SOCO benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Source: https://huggingface.co/papers/2605.31597

Abstract

Semantic Object Correspondence (SOCO) benchmark evaluates structured object understanding in vision models through consistent part-level annotations and keypoint descriptions, revealing gaps between language-grounded localization and visual correspondence while demonstrating strong prediction of downstream task performance.

Measuringstructured object understandinginvision foundation modelsremains challenging due to inconsistent evaluation protocols and limitedpart-level supervision.Semantic correspondence(SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy ofcorrespondence typesand provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of largevision-language models(LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on densedownstream tasks, includingsegmentation,tracking,3D pose estimation, and3D detection, more strongly thanImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2605\.31597

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31597 in a model README.md to link it from this page.

Datasets citing this paper1

#### GenIntelLab/SOCO Updatedabout 22 hours ago • 20 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31597 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Paper page - SOCO: Benchmarking Semantic Object Correspondence in Vision Foundation Models

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

MARCO: Navigating the Unseen Space of Semantic Correspondence

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

Submit Feedback

Similar Articles

MARCO: Navigating the Unseen Space of Semantic Correspondence

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

Weakly Supervised Concept Learning for Object-centric Visual Reasoning

OSCBench: Benchmarking Object State Change in Text-to-Video Generation