透過事實分析推進無參考標準的影片字幕評估

摘要

影片字幕提供了影片中演員、物體和動作的簡明摘要，對於問答系統和事件定位等應用具有重要價值。然而，獲取人工標註的影片字幕成本高昂，甚至在某些多樣化的影片領域中不切實際。現有基於監督數據集訓練的模型在跨領域性能評估上面臨挑戰，這主要歸因於依賴於需要參考真實字幕的基於參考的評估協議。這種假設在評估現實世界中的影片時並不現實。為解決這些限制，我們提出了一種無需參考真實字幕的評估框架，專注於事實基礎，以確保對字幕質量的準確評估。我們引入了VC-Inspector，這是一種新穎的字幕質量評估器，既無需參考又基於事實。利用大型語言模型，我們基於監督數據生成不同質量的偽字幕，隨後用於訓練一個多模態模型（即Qwen2.5-VL）作為評估器。我們的方法在VATEX-Eval數據集上展示了與人類判斷的高度一致性，優於現有方法。當將圖像視為單幀影片時，該性能也泛化到圖像字幕數據集Flickr8K-Expert和Flickr8K-CF。總體而言，VC-Inspector提供了一種可擴展且可泛化的解決方案，用於評估影片字幕的事實準確性，為多樣化影片領域中更有效和客觀的評估方法鋪平了道路。

English

Video captions offer concise snapshots of actors, objects, and actions within a video, serving as valuable assets for applications such as question answering and event localization. However, acquiring human annotations for video captions is costly or even impractical, especially when dealing with diverse video domains. Existing models trained on supervised datasets face challenges in evaluating performance across different domains due to the reliance on reference-based evaluation protocols, which necessitate ground truth captions. This assumption is unrealistic for evaluating videos in the wild. To address these limitations, we propose a reference-free evaluation framework that does not require ground truth captions, focusing on factual grounding to ensure accurate assessment of caption quality. We introduce VC-Inspector, a novel caption quality evaluator that is both reference-free and factually grounded. Utilizing large language models, we generate pseudo captions of varying quality based on supervised data, which are subsequently used to train a multimodal model (i.e., Qwen2.5-VL) as the evaluator. Our approach demonstrates superior alignment with human judgments on the VATEX-Eval dataset, outperforming existing methods. The performance also generalizes to image caption datasets, Flickr8K-Expert and Flickr8K-CF, when viewing images as 1-frame videos. Overall, VC-Inspector offers a scalable and generalizable solution for evaluating the factual accuracy of video captions, paving the way for more effective and objective assessment methodologies in diverse video domains.

透過事實分析推進無參考標準的影片字幕評估

Advancing Reference-free Evaluation of Video Captions with Factual Analysis

摘要

Support