From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Hugging Face Daily Papers 05/10/26, 12:00 AM Papers

Summary

Introduces CAFE, a benchmark for evaluating whether promptable segmentation models truly understand concepts by using counterfactual attribute manipulation, revealing that accurate mask prediction does not guarantee faithful semantic grounding.

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.

Original Article

View Cached Full Text

Cached at: 05/14/26, 12:18 PM

Paper page - From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Source: https://huggingface.co/papers/2605.09591

Abstract

CAFE is a new benchmark for evaluating concept-faithful segmentation in promptable models through attribute-level counterfactual manipulation, revealing that accurate mask prediction does not guarantee semantic grounding.

Segmentation is a fundamental vision task underlying numerous downstream applications. Recentpromptable segmentation models, such asSegment Anything Model 3(SAM3), extend segmentation from category-agnosticmask predictiontoconcept-guided localizationconditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a novel benchmark for evaluating concept-faithful segmentation inpromptable segmentation models. Our CAFE is built on attribute-levelcounterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strongmask predictiondoes not necessarily imply faithfulsemantic grounding. Our CAFE provides a controlled benchmark for diagnosing whetherpromptable segmentation modelsperform concept-faithful grounding rather than shortcut-driven mask retrieval.

View arXiv page View PDF Project page GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2605\.09591

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09591 in a model README.md to link it from this page.

Datasets citing this paper1

#### teemosliang/CAFE Viewer• Updatedabout 4 hours ago • 2.15k • 21

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09591 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Paper page - From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

SAM 3: Segment Anything with Concepts

PixCon: Clean-Positive Contrastive Learning for Foundation-Model Semi-Supervised Segmentation

Concept-based Visual Counterfactual Explanations with Diffusion Models

Principles of Concept Representation in Sentence Encoders

Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems

Submit Feedback

Similar Articles

SAM 3: Segment Anything with Concepts

PixCon: Clean-Positive Contrastive Learning for Foundation-Model Semi-Supervised Segmentation

Concept-based Visual Counterfactual Explanations with Diffusion Models

Principles of Concept Representation in Sentence Encoders

Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems