Set-of-Mark 프롬프팅이 GPT-4V의 탁월한 시각적 그라운딩 능력을 발휘하다

초록

우리는 GPT-4V와 같은 대규모 멀티모달 모델(LMM)의 시각적 기반 능력을 발휘할 수 있는 새로운 시각적 프롬프팅 방법인 Set-of-Mark(SoM)을 제안합니다. 그림 1(오른쪽)에서 보여주듯, 우리는 SAM과 같은 기성 상호작용적 분할 모델을 사용하여 이미지를 다양한 세분화 수준의 영역으로 나누고, 이러한 영역 위에 알파벳, 숫자, 마스크, 박스 등의 마크 세트를 오버레이합니다. 마크가 적용된 이미지를 입력으로 사용하여 GPT-4V는 시각적 기반이 필요한 질문에 답할 수 있습니다. 우리는 SoM의 효과를 검증하기 위해 다양한 세분화된 시각 및 멀티모달 작업에 대한 포괄적인 실증 연구를 수행했습니다. 예를 들어, 우리의 실험은 SoM을 적용한 GPT-4V가 제로샷 설정에서 RefCOCOg에서 최신의 완전히 미세 조정된 참조 분할 모델을 능가하는 것을 보여줍니다.

English

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.

Set-of-Mark 프롬프팅이 GPT-4V의 탁월한 시각적 그라운딩 능력을 발휘하다

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

초록

Support