InstructSAM: 任意指令下的任意實例分割

摘要

本文介紹了InstructSAM，一個統一且精簡的框架，專為在任意指令下進行多實例分割而設計。我們將指令驅動的實例分割表述為一個集合結構的查詢預測問題，並提出一個明確的推理到實例查詢介面，優雅地橋接了視覺語言模型（VLM）與SAM3。具體而言，我們將一組可學習的實例查詢注入VLM中，並將其與指令及視覺資訊進行上下文整合，使每個查詢能夠作為一個具實例感知的槽位。混合注意力機制進一步促進這些查詢、視覺標記與指令標記之間的交互，從而改善實例列舉並減少重複預測。最終的LLM條件化查詢被投影至SAM3的檢測器查詢空間，在單次前向傳遞中驅動精確的多實例分割。此設計賦予SAM高階指令理解、組合推理及實例層級的集合預測能力，卻無需修改其核心架構。為支援訓練與評估，我們進一步構建了Inst2Seg，一個高品質、大規模的基於指令的實例分割資料集與基準，將自由形式的指令與實例層級遮罩相對應。大量實驗顯示，僅2B規模的InstructSAM在複雜指令驅動與短語層級的參考分割基準上均取得強勁成果，超越先前的端到端方法及SAM3的代理流程，同時實現高效的單次多實例預測。

English

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.