简单的标记级置信度提高了标题的正确性。
Simple Token-Level Confidence Improves Caption Correctness
May 11, 2023
作者: Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach
cs.AI
摘要
判断标题是否正确描述图像的能力是视觉语言理解的关键部分。然而,最先进的模型经常会误解细粒度细节的正确性,导致输出错误,例如在生成的标题中产生物体幻觉或组合推理能力不佳。在这项工作中,我们探讨了令牌级置信度,即 TLC,作为一种简单但出乎意料地有效的评估标题正确性的方法。具体而言,我们在图像字幕上对视觉语言模型进行微调,将图像和提议的标题输入模型,并聚合代数或学习的令牌置信度,以估计图像标题一致性。与预训练模型的序列级得分相比,具有代数置信度测量的 TLC 在 SVO-Probes 的动词理解中实现了 10% 的相对准确度提高,并且在 Winoground 的图像和组得分方面分别相对提高了 37% 和 9%,超越了先前的最先进技术。当有训练数据可用时,学习的置信度估计器提供了进一步改进的性能,相对于原始模型,在 MS COCO Captions 中减少了物体幻觉率的 30%,创造了一个新的最先进技术。
English
The ability to judge whether a caption correctly describes an image is a
critical part of vision-language understanding. However, state-of-the-art
models often misinterpret the correctness of fine-grained details, leading to
errors in outputs such as hallucinating objects in generated captions or poor
compositional reasoning. In this work, we explore Token-Level Confidence, or
TLC, as a simple yet surprisingly effective method to assess caption
correctness. Specifically, we fine-tune a vision-language model on image
captioning, input an image and proposed caption to the model, and aggregate
either algebraic or learned token confidences over words or sequences to
estimate image-caption consistency. Compared to sequence-level scores from
pretrained models, TLC with algebraic confidence measures achieves a relative
improvement in accuracy by 10% on verb understanding in SVO-Probes and
outperforms prior state-of-the-art in image and group scores for compositional
reasoning in Winoground by a relative 37% and 9%, respectively. When training
data are available, a learned confidence estimator provides further improved
performance, reducing object hallucination rates in MS COCO Captions by a
relative 30% over the original model and setting a new state-of-the-art.