InstructSAM: Segmenteer elke instantie met elke instructie

Samenvatting

In dit artikel introduceren we InstructSAM, een uniform en gestroomlijnd raamwerk ontworpen voor multi-instantiesegmentatie onder willekeurige instructies. We formuleren instructiegestuurde instantiesegmentatie als een set-gestructureerd queryvoorspellingsprobleem en stellen een expliciete redenering-naar-instantie queryinterface voor die elegant een visie-taalmodel (VLM) en SAM3 met elkaar verbindt. Specifiek wordt een verzameling leerbare instantiequeries in de VLM geïnjecteerd en gecontextualiseerd met instructie en visuele informatie, waardoor elke query kan dienen als een instantiebewuste slot. Een hybride-aandachtsmechanisme bevordert verder de interactie tussen deze queries, visuele tokens en instructietokens, wat de instantie-enumeratie verbetert en dubbele voorspellingen vermindert. De resulterende LLM-geconditioneerde queries worden geprojecteerd in de detectorqueryruimte van SAM3 om nauwkeurige multi-instantiesegmentatie in één enkele voorwaartse doorgang te realiseren. Dit ontwerp voorziet SAM3 van hoogwaardig instructiebegrip, compositioneel redeneren en instantieniveau-setvoorspelling zonder de kernarchitectuur te wijzigen. Ter ondersteuning van training en evaluatie construeren we verder Inst2Seg, een hoogwaardige en grootschalige instructiegebaseerde instantiesegmentatiedataset en benchmark die vrije-vorminstructies koppelt aan instantieniveau-maskers. Uitgebreide experimenten tonen aan dat alleen InstructSAM op 2B-schaal sterke resultaten behaalt op complexe instructiegestuurde en zinsniveaurefererende segmentatiebenchmarks, waarbij het eerdere end-to-endmethoden en de agentische pijplijn van SAM3 overtreft en tegelijkertijd efficiënte single-pass multi-instantievoorspelling mogelijk maakt.

English

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.