你所看到的就是你所閱讀的嗎？改善文本-圖像對齊評估

摘要

自動確定文本和相應圖像是否在語義上對齊對於視覺語言模型是一項重大挑戰，具有生成文本到圖像和圖像到文本任務的應用。在這項工作中，我們研究了用於自動文本-圖像對齊評估的方法。我們首先介紹了SeeTRUE：一個全面的評估集，涵蓋了來自文本到圖像和圖像到文本生成任務的多個數據集，其中包含人類對於給定的文本-圖像對是否在語義上對齊的判斷。然後，我們描述了兩種自動確定對齊的方法：第一種涉及基於問題生成和視覺問答模型的流程，第二種則採用了通過微調多模態預訓練模型的端對端分類方法。這兩種方法在各種文本-圖像對齊任務中均超越了先前的方法，在涉及複雜構圖或不自然圖像的挑戰性案例中取得了顯著改進。最後，我們展示了我們的方法如何能夠定位圖像和給定文本之間的特定不對齊，以及如何將它們用於在文本到圖像生成中自動重新排列候選項。

English

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

你所看到的就是你所閱讀的嗎？改善文本-圖像對齊評估

What You See is What You Read? Improving Text-Image Alignment Evaluation

摘要

Support