単純なトークンレベルの信頼度がキャプションの正確性を向上させる

要旨

キャプションが画像を正しく説明しているかどうかを判断する能力は、視覚と言語の理解において重要な要素です。しかし、最先端のモデルはしばしば細部の正確さを誤解し、生成されたキャプションにおける物体の幻覚（hallucination）や、構成論的推論の不十分さといったエラーを引き起こします。本研究では、キャプションの正確性を評価するためのシンプルでありながら驚くほど効果的な方法として、トークンレベルの信頼度（Token-Level Confidence, TLC）を探求します。具体的には、画像キャプショニング用に視覚言語モデルをファインチューニングし、画像と提案されたキャプションをモデルに入力し、代数的または学習済みのトークン信頼度を単語やシーケンスにわたって集約することで、画像とキャプションの一貫性を推定します。事前学習済みモデルのシーケンスレベルスコアと比較して、代数的信頼度測定を用いたTLCは、SVO-Probesにおける動詞理解の精度で10%の相対的向上を達成し、Winogroundにおける構成論的推論の画像およびグループスコアにおいて、それぞれ37%および9%の相対的改善で従来の最先端を上回りました。学習データが利用可能な場合、学習済みの信頼度推定器はさらなる性能向上を提供し、MS COCO Captionsにおける物体幻覚率を元のモデルに対して相対的に30%削減し、新たな最先端を確立しました。

English

The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding. However, state-of-the-art models often misinterpret the correctness of fine-grained details, leading to errors in outputs such as hallucinating objects in generated captions or poor compositional reasoning. In this work, we explore Token-Level Confidence, or TLC, as a simple yet surprisingly effective method to assess caption correctness. Specifically, we fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate either algebraic or learned token confidences over words or sequences to estimate image-caption consistency. Compared to sequence-level scores from pretrained models, TLC with algebraic confidence measures achieves a relative improvement in accuracy by 10% on verb understanding in SVO-Probes and outperforms prior state-of-the-art in image and group scores for compositional reasoning in Winoground by a relative 37% and 9%, respectively. When training data are available, a learned confidence estimator provides further improved performance, reducing object hallucination rates in MS COCO Captions by a relative 30% over the original model and setting a new state-of-the-art.

単純なトークンレベルの信頼度がキャプションの正確性を向上させる

Simple Token-Level Confidence Improves Caption Correctness

要旨

Support