보는 것이 읽는 것과 같을까? 텍스트-이미지 정렬 평가 개선

초록

텍스트와 해당 이미지가 의미적으로 일치하는지 자동으로 판단하는 것은 시각-언어 모델에게 중요한 과제이며, 이는 생성적 텍스트-이미지 및 이미지-텍스트 작업에 응용됩니다. 본 연구에서는 자동 텍스트-이미지 정렬 평가 방법을 탐구합니다. 먼저, SeeTRUE를 소개합니다: 이는 텍스트-이미지 및 이미지-텍스트 생성 작업에서 다양한 데이터셋을 아우르는 포괄적인 평가 세트로, 주어진 텍스트-이미지 쌍이 의미적으로 일치하는지에 대한 인간의 판단을 포함합니다. 그런 다음, 정렬을 판단하기 위한 두 가지 자동 방법을 설명합니다: 첫 번째는 질문 생성과 시각적 질문 응답 모델을 기반으로 한 파이프라인을 포함하며, 두 번째는 다중모드 사전 학습 모델을 미세 조정하여 종단 간 분류 접근법을 사용합니다. 두 방법 모두 다양한 텍스트-이미지 정렬 작업에서 기존 접근법을 능가하며, 복잡한 구성이나 비자연적인 이미지를 포함하는 어려운 사례에서도 상당한 개선을 보입니다. 마지막으로, 우리의 접근법이 이미지와 주어진 텍스트 간의 특정 불일치를 지역화하는 방법과 텍스트-이미지 생성에서 후보를 자동으로 재순위화하는 데 사용될 수 있는 방법을 보여줍니다.

English

Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation.

보는 것이 읽는 것과 같을까? 텍스트-이미지 정렬 평가 개선

What You See is What You Read? Improving Text-Image Alignment Evaluation

초록

Support