Mark集提示在GPT-4V中釋放出非凡的視覺 grounding 能力

摘要

我們提出了一種名為Mark集（SoM）的新視覺提示方法，旨在發揮大型多模型（LMMs）如GPT-4V的視覺基礎能力。如右圖1所示，我們使用SAM等現成的互動分割模型，將圖像劃分為不同粒度的區域，並在這些區域上覆蓋一組標記，例如字母數字、遮罩、框等。使用標記圖像作為輸入，GPT-4V可以回答需要視覺基礎的問題。我們進行了全面的實證研究，驗證了SoM在各種精細視覺和多模式任務上的有效性。例如，我們的實驗表明，具有SoM的GPT-4V在RefCOCOg的零樣本設置下優於最先進的完全微調指涉分割模型。

English

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.

Mark集提示在GPT-4V中釋放出非凡的視覺 grounding 能力

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

摘要

Support