VisualGPTScore: 다중모달 생성적 사전 훈련 점수를 활용한 시각-언어적 추론

초록

대조적 이미지-텍스트 매칭 손실(예: P(매칭|텍스트, 이미지))로 차별적으로 사전 학습된 시각-언어 모델(VLMs)은 구성적 이해가 부족하다는 비판을 받아왔다. 이는 원본 캡션이 다른 의미론적 문장으로 재배열되더라도 유사한 점수를 출력할 수 있음을 의미한다. 이를 해결하기 위해, 우리는 P(텍스트|이미지)의 {bf 시각적 생성 사전 학습 점수}({bf VisualGPTScore})를 사용할 것을 제안한다. 이는 이미지 조건부 언어 모델을 사용하여 이미지에 조건부된 텍스트 캡션의 가능성을 포착하는 다중모달 생성 점수이다. VLMs이 단순한 단어 모음 모델이라는 믿음과는 달리, 우리의 즉시 사용 가능한 VisualGPTScore는 구성적 추론을 평가하는 ARO 및 Crepe와 같은 최근 제안된 이미지-텍스트 검색 벤치마크에서 최고 수준의 성능을 보여준다. 더 나아가, 우리는 VisualGPTScore를 주변 P(텍스트)와 점별 상호 정보(PMI)의 곱으로 분해한다. 이는 (a) 강력한 언어 편향을 가진 데이터셋을 진단하고, (b) 정보 이론적 프레임워크를 사용하여 Winoground와 같은 다른 벤치마크의 결과에서 편향을 제거하는 데 도움을 준다. VisualGPTScore는 가치 있는 통찰을 제공하며, 향후 시각-언어 구성성 평가를 위한 강력한 기준선으로서의 역할을 한다.

English

Vision-language models (VLMs) discriminatively pre-trained with contrastive image-text matching losses such as P(match|text, image) have been criticized for lacking compositional understanding. This means they might output similar scores even if the original caption is rearranged into a different semantic statement. To address this, we propose to use the {bf V}isual {bf G}enerative {bf P}re-{bf T}raining Score ({bf VisualGPTScore}) of P(text|image), a multimodal generative score that captures the likelihood of a text caption conditioned on an image using an image-conditioned language model. Contrary to the belief that VLMs are mere bag-of-words models, our off-the-shelf VisualGPTScore demonstrates top-tier performance on recently proposed image-text retrieval benchmarks like ARO and Crepe that assess compositional reasoning. Furthermore, we factorize VisualGPTScore into a product of the marginal P(text) and the Pointwise Mutual Information (PMI). This helps to (a) diagnose datasets with strong language bias, and (b) debias results on other benchmarks like Winoground using an information-theoretic framework. VisualGPTScore provides valuable insights and serves as a strong baseline for future evaluation of visio-linguistic compositionality.

VisualGPTScore: 다중모달 생성적 사전 훈련 점수를 활용한 시각-언어적 추론

VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores

초록

Support