InstructSAM: Segment Any Instance with Any Instructions
Summary
InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3, achieving strong results across complex benchmarks.
View Cached Full Text
Cached at: 05/26/26, 06:42 AM
Paper page - InstructSAM: Segment Any Instance with Any Instructions
Source: https://huggingface.co/papers/2605.26102
Abstract
InstructSAM presents a unified framework for multi-instance segmentation using instruction-driven queries that bridge vision-language models and SAM3 through learnable instance queries and hybrid attention mechanisms.
In this paper, we introduce InstructSAM, a unified and streamlined framework designed formulti-instance segmentationunder arbitrary instructions. We formulatesinstruction-driven instance segmentationas aset-structured query predictionproblem and propose anexplicit reasoning-to-instance query interfacethat elegantly bridges avision-language model(VLM) andSAM3. Specifically, a bank oflearnable instance queriesis injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. Ahybrid-attention mechanismfurther promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resultingLLM-conditioned queriesare projected intoSAM3’s detector query space to drive accuratemulti-instance segmentationin asingle forward pass. This design equipsSAM3with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further constructInst2Seg, a high-quality and large-scaleinstruction-based instance segmentationdataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven andphrase-level referring segmentationbenchmarks, outperforming prior end-to-end methods andSAM3’s agentic pipeline while enabling efficient single-pass multi-instance prediction.
View arXiv pageView PDFGitHub9Add to collection
Get this paper in your agent:
hf papers read 2605\.26102
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.26102 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.26102 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26102 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
SAM 3: Segment Anything with Concepts
SAM 3 introduces a unified model for promptable concept segmentation and tracking, achieving state-of-the-art performance with a decoupled recognition and localization architecture and a scalable data engine.
SAM 3.1: Faster and More Accessible Real-Time Video Detection and Tracking With Multiplexing and Global Reasoning
Meta AI releases SAM 3.1, an update to the Segment Anything Model that enhances real-time video detection and tracking through multiplexing and global reasoning capabilities.
@skalskip92: there's no catch; SAM3 is open source and really good one of the things it does really well is object tracking, even in…
SAM3 (Segment Anything Model 3) is open source and performs exceptionally well at object tracking even in complex scenes like basketball, making it a standout computer vision model.
@lillyguisnet: WEEE!!! I had not had the opportunity to try SAM3.1 yet, but simply prompting for "worm" perfectly segmented my images!…
A user shares enthusiastic feedback about SAM 3.1's ability to accurately segment images using simple text prompts like 'worm', highlighting significant improvements over SAM 1.
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent
This paper proposes SAM, a state-adaptive memory framework that dynamically manages interaction histories for long-horizon agentic reasoning, enabling intent-driven recall without retraining the backbone model. It outperforms strong baselines across multiple benchmarks like BrowseComp and HLE.