InstructSAM: 임의의 지시에 따른 임의의 인스턴스 분할

초록

본 논문에서는 임의의 명령어에 따른 다중 인스턴스 분할을 위해 설계된 통합적이고 간결한 프레임워크인 InstructSAM을 소개한다. 명령어 기반 인스턴스 분할을 집합 구조의 쿼리 예측 문제로 정식화하고, 비전-언어 모델(VLM)과 SAM3를 우아하게 연결하는 명시적 추론-대-인스턴스 쿼리 인터페이스를 제안한다. 구체적으로, 학습 가능한 인스턴스 쿼리 뱅크를 VLM에 주입하고 명령어 및 시각 정보와 맥락화하여 각 쿼리가 인스턴스 인식 슬롯으로 기능하도록 한다. 하이브리드 어텐션 메커니즘은 이러한 쿼리, 시각 토큰 및 명령어 토큰 간의 상호 작용을 더욱 촉진하여 인스턴스 열거를 개선하고 중복 예측을 줄인다. 결과적으로 생성된 LLM 조건부 쿼리는 SAM3의 검출기 쿼리 공간으로 투영되어 단일 순방향 패스에서 정확한 다중 인스턴스 분할을 구동한다. 이 설계는 SAM3의 핵심 아키텍처를 수정하지 않고도 고수준 명령어 이해, 구성적 추론 및 인스턴스 수준 집합 예측을 가능하게 한다. 훈련 및 평가를 지원하기 위해 자유 형식의 명령어와 인스턴스 수준 마스크를 결합한 고품질의 대규모 명령어 기반 인스턴스 분할 데이터셋 및 벤치마크인 Inst2Seg를 추가로 구축하였다. 광범위한 실험을 통해 2B 규모의 InstructSAM만으로도 복잡한 명령어 기반 및 구문 수준 지시 분할 벤치마크에서 강력한 성능을 달성하며, 기존의 종단간 방법 및 SAM3의 에이전트 파이프라인을 능가하면서 효율적인 단일 패스 다중 인스턴스 예측을 가능하게 함을 보여준다.

English

In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.