InstructSAM: 任意の指示で任意のインスタンスをセグメント化

要旨

本論文では、任意の指示下でマルチインスタンスセグメンテーションを実現する統合的で効率的なフレームワークInstructSAMを提案する。指示駆動型インスタンスセグメンテーションを集合構造のクエリ予測問題として定式化し、視覚言語モデル（VLM）とSAM3を巧妙に橋渡しする明示的な推論-インスタンス間クエリインターフェースを導入する。具体的には、学習可能なインスタンスクエリのバンクをVLMに注入し、指示情報および視覚情報と共に文脈化することで、各クエリがインスタンス認識スロットとして機能するようにする。また、ハイブリッドアテンション機構により、これらのクエリ、視覚トークン、指示トークン間の相互作用を促進し、インスタンスの列挙精度を向上させるとともに重複予測を低減する。得られたLLM条件付きクエリはSAM3の検出器クエリ空間に投影され、単一のフォワードパスで正確なマルチインスタンスセグメンテーションを実現する。本設計は、SAM3のコアアーキテクチャを変更することなく、高レベルの指示理解、構成推論、およびインスタンスレベルの集合予測を付与する。さらに、訓練と評価を支援するため、自由形式の指示とインスタンスレベルのマスクを組み合わせた大規模高品質な指示ベースインスタンスセグメンテーションデータセットおよびベンチマークInst2Segを構築した。広範な実験により、2BスケールのInstructSAMのみで、複雑な指示駆動型およびフレーズレベルの参照セグメンテーションベンチマークにおいて強力な結果を達成し、従来のエンドツーエンド手法やSAM3のエージェントパイプラインを上回りつつ、効率的な単一パスによるマルチインスタンス予測を可能にすることを示した。

English

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.