大型视觉语言模型中的图像注意力提示

摘要

与大型语言模型（LLMs）相比，大型视觉-语言模型（LVLMs）还可以接受图像作为输入，因此展示出更有趣的新兴能力，并在各种视觉-语言任务上展现出令人印象深刻的性能。受LLMs中文本提示的启发，已经探索了视觉提示以增强LVLMs感知视觉信息的能力。然而，先前的视觉提示技术仅处理视觉输入，而未考虑文本查询，限制了模型遵循文本指令完成任务的能力。为填补这一空白，在本研究中，我们提出了一种名为“图像上的注意力提示”的新提示技术，简单地在原始输入图像上叠加一个文本查询引导的注意力热图，从而有效增强LVLM在各种任务上的表现。具体而言，我们使用类似CLIP的辅助模型为输入图像生成依赖于文本查询的注意力热图。然后，该热图简单地将原始图像的像素值相乘，以获得LVLM的实际输入图像。在各种视觉-语言基准测试上进行了大量实验，验证了我们技术的有效性。例如，“图像上的注意力提示”在LLaVA-1.5基准测试上分别使MM-Vet和LLaVA-Wild基准测试提高了3.8%和2.9%。

English

Compared with Large Language Models (LLMs), Large Vision-Language Models (LVLMs) can also accept images as input, thus showcasing more interesting emergent capabilities and demonstrating impressive performance on various vision-language tasks. Motivated by text prompting in LLMs, visual prompting has been explored to enhance LVLMs' capabilities of perceiving visual information. However, previous visual prompting techniques solely process visual inputs without considering text queries, limiting the models' ability to follow text instructions to complete tasks. To fill this gap, in this work, we propose a new prompting technique named Attention Prompting on Image, which just simply overlays a text-query-guided attention heatmap on the original input image and effectively enhances LVLM on various tasks. Specifically, we generate an attention heatmap for the input image dependent on the text query with an auxiliary model like CLIP. Then the heatmap simply multiplies the pixel values of the original image to obtain the actual input image for the LVLM. Extensive experiments on various vison-language benchmarks verify the effectiveness of our technique. For example, Attention Prompting on Image improves LLaVA-1.5 by 3.8% and 2.9% on MM-Vet and LLaVA-Wild benchmarks, respectively.

大型视觉语言模型中的图像注意力提示

Attention Prompting on Image for Large Vision-Language Models

摘要

Support