InstructSAM: 任意指令下的任意實例分割
InstructSAM: Segment Any Instance with Any Instructions
May 25, 2026
作者: Yuqian Yuan, Wentong Li, Zhaocheng Li, Yutong Lin, Juncheng Li, Siliang Tang, Jun Xiao, Yueting Zhuang, Wenqiao Zhang
cs.AI
摘要
本文介紹了InstructSAM,一個統一且精簡的框架,專為在任意指令下進行多實例分割而設計。我們將指令驅動的實例分割表述為一個集合結構的查詢預測問題,並提出一個明確的推理到實例查詢介面,優雅地橋接了視覺語言模型(VLM)與SAM3。具體而言,我們將一組可學習的實例查詢注入VLM中,並將其與指令及視覺資訊進行上下文整合,使每個查詢能夠作為一個具實例感知的槽位。混合注意力機制進一步促進這些查詢、視覺標記與指令標記之間的交互,從而改善實例列舉並減少重複預測。最終的LLM條件化查詢被投影至SAM3的檢測器查詢空間,在單次前向傳遞中驅動精確的多實例分割。此設計賦予SAM高階指令理解、組合推理及實例層級的集合預測能力,卻無需修改其核心架構。為支援訓練與評估,我們進一步構建了Inst2Seg,一個高品質、大規模的基於指令的實例分割資料集與基準,將自由形式的指令與實例層級遮罩相對應。大量實驗顯示,僅2B規模的InstructSAM在複雜指令驅動與短語層級的參考分割基準上均取得強勁成果,超越先前的端到端方法及SAM3的代理流程,同時實現高效的單次多實例預測。
English
In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.