InstructSAM: Segment Any Instance with Any Instructions

Hugging Face Daily Papers 05/25/26, 12:00 AM Papers

Summary

InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3, achieving strong results across complex benchmarks.

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.

Original Article

View Cached Full Text

Cached at: 05/26/26, 06:42 AM

Paper page - InstructSAM: Segment Any Instance with Any Instructions

Source: https://huggingface.co/papers/2605.26102

Abstract

InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3 through learnable instance queries and hybrid attention mechanisms.

In this paper, we introduce InstructSAM, a unified and streamlined framework designed formulti-instance segmentationunder arbitrary instructions. We formulatesinstruction-driven instance segmentationas aset-structured query predictionproblem and propose anexplicit reasoning-to-instance query interfacethat elegantly bridges avision-language model(VLM) andSAM3. Specifically, a bank oflearnable instance queriesis injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. Ahybrid-attention mechanismfurther promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resultingLLM-conditioned queriesare projected intoSAM3’s detector query space to drive accuratemulti-instance segmentationin asingle forward pass. This design equipsSAM3with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further constructInst2Seg, a high-quality and large-scaleinstruction-based instance segmentationdataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven andphrase-level referring segmentationbenchmarks, outperforming prior end-to-end methods andSAM3’s agentic pipeline while enabling efficient single-pass multi-instance prediction.

View arXiv page View PDF GitHub9 Add to collection

Get this paper in your agent:

hf papers read 2605\.26102

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.26102 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.26102 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.26102 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

InstructSAM: Segment Any Instance with Any Instructions

Paper page - InstructSAM: Segment Any Instance with Any Instructions

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

SAM 3: Segment Anything with Concepts

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

@skalskip92: there's no catch; SAM3 is open source and really good one of the things it does really well is object tracking, even in…

@lillyguisnet: WEEE!!! I had not had the opportunity to try SAM3.1 yet, but simply prompting for "worm" perfectly segmented my images!…

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Submit Feedback

Similar Articles

SAM 3: Segment Anything with Concepts

SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning

@skalskip92: there's no catch; SAM3 is open source and really good one of the things it does really well is object tracking, even in…

@lillyguisnet: WEEE!!! I had not had the opportunity to try SAM3.1 yet, but simply prompting for "worm" perfectly segmented my images!…

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent