VisualGPTScore：具有多模式生成預訓練分數的視覺語言推理

摘要

以對比圖像-文本匹配損失（如P（match|text, image））進行區分性預訓練的視覺語言模型（VLMs）被批評缺乏組成理解。這意味著即使原始標題被重新排列為不同的語義陳述，它們可能輸出相似的分數。為解決這個問題，我們建議使用P（text|image）的{bf V}isual {bf G}enerative {bf P}re-{bf T}raining Score（{bf VisualGPTScore}），這是一個多模態生成分數，通過使用圖像條件語言模型來捕獲在圖像條件下的文本標題的可能性。與VLMs僅僅是詞袋模型的觀點相反，我們的現成VisualGPTScore在最近提出的評估組成推理的圖像-文本檢索基準（如ARO和Crepe）上展現了頂尖性能。此外，我們將VisualGPTScore分解為邊際P（text）和點對點互信息（PMI）的乘積。這有助於（a）診斷具有強語言偏見的數據集，以及（b）使用信息理論框架對其他基準（如Winoground）進行去偏置處理。VisualGPTScore提供了有價值的見解，並為未來評估視覺語言組成性提供了堅實的基準。

English

Vision-language models (VLMs) discriminatively pre-trained with contrastive image-text matching losses such as P(match|text, image) have been criticized for lacking compositional understanding. This means they might output similar scores even if the original caption is rearranged into a different semantic statement. To address this, we propose to use the {bf V}isual {bf G}enerative {bf P}re-{bf T}raining Score ({bf VisualGPTScore}) of P(text|image), a multimodal generative score that captures the likelihood of a text caption conditioned on an image using an image-conditioned language model. Contrary to the belief that VLMs are mere bag-of-words models, our off-the-shelf VisualGPTScore demonstrates top-tier performance on recently proposed image-text retrieval benchmarks like ARO and Crepe that assess compositional reasoning. Furthermore, we factorize VisualGPTScore into a product of the marginal P(text) and the Pointwise Mutual Information (PMI). This helps to (a) diagnose datasets with strong language bias, and (b) debias results on other benchmarks like Winoground using an information-theoretic framework. VisualGPTScore provides valuable insights and serves as a strong baseline for future evaluation of visio-linguistic compositionality.

VisualGPTScore：具有多模式生成預訓練分數的視覺語言推理

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

摘要

Support