Mark集提示在GPT-4V中释放出非凡的视觉 grounding 能力

摘要

我们提出了一种名为Mark集合（SoM）的新视觉提示方法，旨在释放大型多模态模型（LMMs）如GPT-4V的视觉基础能力。如图1（右）所示，我们使用现成的交互式分割模型，如SAM，将图像分割为不同粒度的区域，并在这些区域上叠加一组标记，例如字母数字、蒙版、框等。使用带有标记的图像作为输入，GPT-4V可以回答需要视觉基础的问题。我们进行了全面的实证研究，验证了SoM在广泛的细粒度视觉和多模态任务上的有效性。例如，我们的实验表明，具有SoM的GPT-4V在零-shot设置下在RefCOCOg上的表现优于最先进的完全微调的指代分割模型。

English

We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SAM, to partition an image into regions at different levels of granularity, and overlay these regions with a set of marks e.g., alphanumerics, masks, boxes. Using the marked image as input, GPT-4V can answer the questions that require visual grounding. We perform a comprehensive empirical study to validate the effectiveness of SoM on a wide range of fine-grained vision and multimodal tasks. For example, our experiments show that GPT-4V with SoM outperforms the state-of-the-art fully-finetuned referring segmentation model on RefCOCOg in a zero-shot setting.

Mark集提示在GPT-4V中释放出非凡的视觉 grounding 能力

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

摘要

Support