모든 것을 분할하고 설명하기

초록

우리는 Segment Anything Model(SAM)에 지역 캡션 생성 능력을 효율적으로 부여하는 방법을 제안한다. SAM은 무엇이든 세그멘테이션할 수 있는 강력한 일반화 능력을 보여주지만, 의미 이해 측면에서는 부족한 점이 있다. 경량화된 쿼리 기반 특징 혼합기를 도입함으로써, 지역 특정 특징을 언어 모델의 임베딩 공간과 정렬하여 이후 캡션 생성을 가능하게 한다. 학습 가능한 매개변수의 수가 적기 때문에(일반적으로 수천만 개 수준), 계산 비용, 메모리 사용량, 통신 대역폭이 적게 소모되어 빠르고 확장 가능한 학습이 가능하다. 지역 캡션 데이터의 부족 문제를 해결하기 위해, 먼저 객체 탐지 및 세그멘테이션 작업에서 모델을 사전 학습하는 방법을 제안한다. 이 단계를 약한 감독 사전 학습이라고 부르는데, 사전 학습 데이터에는 전체 문장 설명 대신 카테고리 이름만 포함되기 때문이다. 약한 감독 사전 학습을 통해 공개된 많은 객체 탐지 및 세그멘테이션 데이터셋을 활용할 수 있다. 우리는 광범위한 실험을 통해 제안 방법의 우수성을 입증하고 각 설계 선택을 검증한다. 이 작업은 지역 캡셔닝 데이터를 확장하기 위한 디딤돌 역할을 하며, SAM에 지역 의미를 추가하는 효율적인 방법을 탐구하는 데 빛을 비춘다. 프로젝트 페이지와 관련 코드는 다음 링크에서 확인할 수 있다: https://xk-huang.github.io/segment-caption-anything/.

English

We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pre-training data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via the following https://xk-huang.github.io/segment-caption-anything/.

모든 것을 분할하고 설명하기

Segment and Caption Anything

초록

Support