SAM 3: Segment Anything with Concepts

Papers with Code Trending Papers

Summary

SAM 3 introduces a unified model for promptable concept segmentation and tracking, achieving state-of-the-art performance with a decoupled recognition and localization architecture and a scalable data engine.

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.
Original Article
View Cached Full Text

Cached at: 05/20/26, 02:24 AM

Paper page - SAM 3: Segment Anything with Concepts

Source: https://huggingface.co/papers/2511.16719

Abstract

Segment Anything Model 3 achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization.

We presentSegment Anything Model(SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based onconcept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both.Promptable Concept Segmentation(PCS) takes such prompts and returnssegmentation masksandunique identitiesfor all matching object instances. To advance PCS, we build ascalable data enginethat produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of animage-level detectorand amemory-based video trackerthat share a single backbone. Recognition and localization are decoupled with apresence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities onvisual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark forpromptable concept segmentation.

View arXiv pageView PDFProject pageGitHub9.65kAdd to collection

Get this paper in your agent:

hf papers read 2511\.16719

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper3

#### AllanVester/SAM3.1-CoreML-FP16 Mask Generation• Updatedabout 1 month ago • 97 • 3 #### AllanVester/SAM3.1-CoreML Mask Generation• Updatedabout 1 month ago • 58 • 2 #### embedl/sam3 Updated16 days ago • 54 • 1

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2511.16719 in a dataset README.md to link it from this page.

Spaces citing this paper1

Collections including this paper22

Browse 22 collections that include this paper

Similar Articles

InstructSAM: Segment Any Instance with Any Instructions

Hugging Face Daily Papers

InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3, achieving strong results across complex benchmarks.