InstructSAM: 通过任意指令分割任意实例

摘要

本文提出InstructSAM——一个统一且精简的框架，旨在实现任意指令下的多实例分割。我们将指令驱动的实例分割形式化为集合结构的查询预测问题，并提出一种显式的推理到实例查询接口，优雅地桥接了视觉语言模型（VLM）与SAM3。具体而言，一组可学习的实例查询被注入VLM中，并与指令及视觉信息进行上下文关联，使每个查询充当实例感知的插槽。混合注意力机制进一步促进这些查询、视觉令牌与指令令牌之间的交互，从而改进实例枚举并减少重复预测。最终由大语言模型条件化的查询被投影到SAM3的检测器查询空间，仅需单次前向传播即可驱动精确的多实例分割。该设计在不修改核心架构的前提下，赋予SAM3高阶指令理解、组合推理及实例级集合预测能力。为支持训练与评估，我们进一步构建了Inst2Seg——一个高质量大规模指令驱动实例分割数据集与基准，将自由形式指令与实例级掩码相结合。大量实验表明，仅2B规模参数的InstructSAM在复杂指令驱动及短语级指代分割基准上均取得强劲结果，超越了此前端到端方法及SAM3的代理流程管线，同时实现了高效的单次多实例预测。

English

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.