VisualGPTScore：具有多模态生成预训练得分的视觉语言推理

摘要

使用对比图像文本匹配损失（如P（match|text, image））进行判别式预训练的视觉语言模型（VLMs）因缺乏组合理解而受到批评。这意味着即使原始标题被重新排列为不同的语义陈述，它们可能会输出相似的分数。为了解决这个问题，我们提出使用P（text|image）的{bf视觉生成预训练分数（VisualGPTScore）}，这是一个多模态生成分数，它使用基于图像的语言模型来捕捉在图像条件下的文本标题的可能性。与认为VLMs仅是词袋模型的观点相反，我们的现成VisualGPTScore在最近提出的评估组合推理的图像文本检索基准（如ARO和Crepe）上表现出顶尖性能。此外，我们将VisualGPTScore分解为P（text）的边际和点间互信息（PMI）的乘积。这有助于（a）诊断具有强语言偏见的数据集，并且（b）使用信息论框架对其他基准（如Winoground）的结果进行去偏。VisualGPTScore提供了有价值的见解，并为未来评估视觉语言组合性提供了强大的基线。

English

Vision-language models (VLMs) discriminatively pre-trained with contrastive image-text matching losses such as P(match|text, image) have been criticized for lacking compositional understanding. This means they might output similar scores even if the original caption is rearranged into a different semantic statement. To address this, we propose to use the {bf V}isual {bf G}enerative {bf P}re-{bf T}raining Score ({bf VisualGPTScore}) of P(text|image), a multimodal generative score that captures the likelihood of a text caption conditioned on an image using an image-conditioned language model. Contrary to the belief that VLMs are mere bag-of-words models, our off-the-shelf VisualGPTScore demonstrates top-tier performance on recently proposed image-text retrieval benchmarks like ARO and Crepe that assess compositional reasoning. Furthermore, we factorize VisualGPTScore into a product of the marginal P(text) and the Pointwise Mutual Information (PMI). This helps to (a) diagnose datasets with strong language bias, and (b) debias results on other benchmarks like Winoground using an information-theoretic framework. VisualGPTScore provides valuable insights and serves as a strong baseline for future evaluation of visio-linguistic compositionality.

VisualGPTScore：具有多模态生成预训练得分的视觉语言推理

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

摘要

Support