ChatPaper.aiChatPaper

單詞級信心水平提升了標題的正確性。

Simple Token-Level Confidence Improves Caption Correctness

May 11, 2023
作者: Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach
cs.AI

摘要

判斷說明文字是否正確描述圖像的能力是視覺語言理解的關鍵部分。然而,最先進的模型常常會誤解細節的正確性,導致輸出錯誤,例如在生成的說明中產生幻覺物件或是組合推理不佳。在這項研究中,我們探索了一種名為Token-Level Confidence(TLC)的簡單但出乎意料地有效的方法來評估說明文字的正確性。具體來說,我們在圖像標註上微調視覺語言模型,將圖像和提議的說明輸入模型,並聚合代數或學習的單詞或序列的標記信心,以估計圖像說明的一致性。與預先訓練模型的序列級別分數相比,具有代數信心度量的TLC在SVO-Probes的動詞理解方面實現了10%的相對準確性改善,並在Winoground的組合推理中分別相對提高了37%和9%的圖像和群組分數,超越了先前的最先進技術。當有訓練數據時,學習的信心估計器提供了進一步改進的性能,將MS COCO Captions中的物件幻覺率相對降低了30%,超越了原始模型,創立了新的最先進技術。
English
The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.
PDF10December 15, 2024