你所看到的就是你所閱讀的嗎?改善文本-圖像對齊評估
What You See is What You Read? Improving Text-Image Alignment Evaluation
May 17, 2023
作者: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor
cs.AI
摘要
自動確定文本和相應圖像是否在語義上對齊對於視覺語言模型是一項重大挑戰,具有生成文本到圖像和圖像到文本任務的應用。在這項工作中,我們研究了用於自動文本-圖像對齊評估的方法。我們首先介紹了SeeTRUE:一個全面的評估集,涵蓋了來自文本到圖像和圖像到文本生成任務的多個數據集,其中包含人類對於給定的文本-圖像對是否在語義上對齊的判斷。然後,我們描述了兩種自動確定對齊的方法:第一種涉及基於問題生成和視覺問答模型的流程,第二種則採用了通過微調多模態預訓練模型的端對端分類方法。這兩種方法在各種文本-圖像對齊任務中均超越了先前的方法,在涉及複雜構圖或不自然圖像的挑戰性案例中取得了顯著改進。最後,我們展示了我們的方法如何能夠定位圖像和給定文本之間的特定不對齊,以及如何將它們用於在文本到圖像生成中自動重新排列候選項。
English
Automatically determining whether a text and a corresponding image are
semantically aligned is a significant challenge for vision-language models,
with applications in generative text-to-image and image-to-text tasks. In this
work, we study methods for automatic text-image alignment evaluation. We first
introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets
from both text-to-image and image-to-text generation tasks, with human
judgements for whether a given text-image pair is semantically aligned. We then
describe two automatic methods to determine alignment: the first involving a
pipeline based on question generation and visual question answering models, and
the second employing an end-to-end classification approach by finetuning
multimodal pretrained models. Both methods surpass prior approaches in various
text-image alignment tasks, with significant improvements in challenging cases
that involve complex composition or unnatural images. Finally, we demonstrate
how our approaches can localize specific misalignments between an image and a
given text, and how they can be used to automatically re-rank candidates in
text-to-image generation.