단순한 토큰 수준의 신뢰도가 캡션 정확도를 향상시킨다

초록

이미지 캡션이 이미지를 정확하게 설명하는지 판단하는 능력은 시각-언어 이해의 중요한 부분입니다. 그러나 최첨단 모델들은 종종 미세한 세부 사항의 정확성을 잘못 해석하여, 생성된 캡션에서 객체를 허구적으로 만들어내거나 구성적 추론에서 부족한 성능을 보이는 등의 오류를 일으킵니다. 본 연구에서는 캡션 정확성을 평가하는 간단하면서도 놀라울 정도로 효과적인 방법으로서 토큰 수준 신뢰도(Token-Level Confidence, TLC)를 탐구합니다. 구체적으로, 우리는 시각-언어 모델을 이미지 캡션 생성 작업에 대해 미세 조정하고, 이미지와 제안된 캡션을 모델에 입력한 후, 단어나 시퀀스에 대한 대수적 또는 학습된 토큰 신뢰도를 집계하여 이미지-캡션 일관성을 추정합니다. 사전 학습된 모델의 시퀀스 수준 점수와 비교했을 때, 대수적 신뢰도 측정을 사용한 TLC는 SVO-Probes에서 동사 이해 정확도에서 10%의 상대적 개선을 달성했으며, Winoground의 구성적 추론에서 이미지 및 그룹 점수에서 각각 37%와 9%의 상대적 우수성을 보였습니다. 학습 데이터가 사용 가능한 경우, 학습된 신뢰도 추정기는 더욱 향상된 성능을 제공하여, MS COCO Captions에서 객체 허구화 비율을 원본 모델 대비 상대적으로 30% 감소시키고 새로운 최첨단 성능을 설정했습니다.

English

The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.

단순한 토큰 수준의 신뢰도가 캡션 정확도를 향상시킨다

Simple Token-Level Confidence Improves Caption Correctness

초록

Support