VisualGPTScore: Visio-linguïstisch redeneren met scores voor multimodale generatieve voorafgaande training

Samenvatting

Vision-language models (VLMs) die discriminerend zijn voorgetraind met contrastieve beeld-tekst matching verliesfuncties zoals P(match|tekst, beeld), zijn bekritiseerd vanwege een gebrek aan compositioneel begrip. Dit betekent dat ze vergelijkbare scores kunnen uitvoeren, zelfs als de originele bijschrift wordt herschikt tot een andere semantische uitspraak. Om dit aan te pakken, stellen we voor om de {bf V}isual {bf G}enerative {bf P}re-{bf T}raining Score ({bf VisualGPTScore}) van P(tekst|beeld) te gebruiken, een multimodale generatieve score die de waarschijnlijkheid van een tekstbijschrift, geconditioneerd op een beeld, vastlegt met behulp van een beeld-geconditioneerd taalmodel. In tegenstelling tot de opvatting dat VLMs slechts bag-of-words modellen zijn, laat onze kant-en-klare VisualGPTScore top prestaties zien op recent voorgestelde beeld-tekst retrieval benchmarks zoals ARO en Crepe die compositioneel redeneren beoordelen. Bovendien factoriseren we VisualGPTScore in een product van de marginale P(tekst) en de Pointwise Mutual Information (PMI). Dit helpt om (a) datasets met sterke taal bias te diagnosticeren, en (b) resultaten op andere benchmarks zoals Winoground te debiasen met behulp van een informatie-theoretisch raamwerk. VisualGPTScore biedt waardevolle inzichten en dient als een sterke baseline voor toekomstige evaluatie van visio-linguïstische compositionaliteit.

English

Vision-language models (VLMs) discriminatively pre-trained with contrastive image-text matching losses such as P(match|text, image) have been criticized for lacking compositional understanding. This means they might output similar scores even if the original caption is rearranged into a different semantic statement. To address this, we propose to use the {bf V}isual {bf G}enerative {bf P}re-{bf T}raining Score ({bf VisualGPTScore}) of P(text|image), a multimodal generative score that captures the likelihood of a text caption conditioned on an image using an image-conditioned language model. Contrary to the belief that VLMs are mere bag-of-words models, our off-the-shelf VisualGPTScore demonstrates top-tier performance on recently proposed image-text retrieval benchmarks like ARO and Crepe that assess compositional reasoning. Furthermore, we factorize VisualGPTScore into a product of the marginal P(text) and the Pointwise Mutual Information (PMI). This helps to (a) diagnose datasets with strong language bias, and (b) debias results on other benchmarks like Winoground using an information-theoretic framework. VisualGPTScore provides valuable insights and serves as a strong baseline for future evaluation of visio-linguistic compositionality.

VisualGPTScore: Visio-linguïstisch redeneren met scores voor multimodale generatieve voorafgaande training

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

Samenvatting

Support