見たままが読んだ通り？テキストと画像の整合性評価の改善

要旨

テキストと対応する画像が意味的に整合しているかどうかを自動的に判定することは、視覚言語モデルにとって重要な課題であり、テキストから画像、画像からテキストの生成タスクに応用されています。本研究では、テキストと画像の整合性を自動的に評価する手法を探ります。まず、SeeTRUEを紹介します。これは、テキストから画像および画像からテキストの生成タスクにわたる複数のデータセットを網羅した包括的な評価セットで、与えられたテキストと画像のペアが意味的に整合しているかどうかの人間による判断を含みます。次に、整合性を判定する2つの自動手法を説明します。1つ目は、質問生成と視覚的質問応答モデルに基づくパイプラインを用いる手法、2つ目は、マルチモーダル事前学習モデルをファインチューニングするエンドツーエンドの分類アプローチを採用する手法です。どちらの手法も、複雑な構成や非自然な画像を含む難しいケースにおいて、従来のアプローチを大幅に上回る性能を示しました。最後に、我々の手法が画像と与えられたテキストの間の特定の不一致を特定する方法と、テキストから画像の生成において候補を自動的に再ランク付けする方法を実証します。

English

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

見たままが読んだ通り？テキストと画像の整合性評価の改善

What You See is What You Read? Improving Text-Image Alignment Evaluation

要旨

Support