通过事实分析推进视频字幕的无参考评估
Advancing Reference-free Evaluation of Video Captions with Factual Analysis
September 20, 2025
作者: Shubhashis Roy Dipta, Tz-Ying Wu, Subarna Tripathi
cs.AI
摘要
视频字幕提供了视频中演员、物体和动作的简洁概览,对于问答系统和事件定位等应用而言,是宝贵的资源。然而,获取人工标注的视频字幕成本高昂,甚至不切实际,尤其是在处理多样化的视频领域时。现有的基于监督数据集训练的模型在跨领域性能评估上面临挑战,这主要归因于依赖于参考标准的评估协议,该协议要求具备真实字幕作为基准。这一假设在评估现实世界中的视频时显得不切实际。为解决这些局限,我们提出了一种无需参考字幕的评估框架,该框架聚焦于事实基础,以确保对字幕质量的准确评估。我们引入了VC-Inspector,一种新颖的字幕质量评估器,它既无需参考字幕又基于事实。利用大型语言模型,我们基于监督数据生成了不同质量的伪字幕,随后用于训练一个多模态模型(即Qwen2.5-VL)作为评估器。我们的方法在VATEX-Eval数据集上展现出与人类判断更优的一致性,超越了现有方法。当将图像视为单帧视频时,该性能也推广至图像字幕数据集Flickr8K-Expert和Flickr8K-CF。总体而言,VC-Inspector为评估视频字幕的事实准确性提供了一个可扩展且通用性强的解决方案,为在多样化视频领域中实现更有效、更客观的评估方法铺平了道路。
English
Video captions offer concise snapshots of actors, objects, and actions within
a video, serving as valuable assets for applications such as question answering
and event localization. However, acquiring human annotations for video captions
is costly or even impractical, especially when dealing with diverse video
domains. Existing models trained on supervised datasets face challenges in
evaluating performance across different domains due to the reliance on
reference-based evaluation protocols, which necessitate ground truth captions.
This assumption is unrealistic for evaluating videos in the wild. To address
these limitations, we propose a reference-free evaluation framework that does
not require ground truth captions, focusing on factual grounding to ensure
accurate assessment of caption quality. We introduce VC-Inspector, a novel
caption quality evaluator that is both reference-free and factually grounded.
Utilizing large language models, we generate pseudo captions of varying quality
based on supervised data, which are subsequently used to train a multimodal
model (i.e., Qwen2.5-VL) as the evaluator. Our approach demonstrates superior
alignment with human judgments on the VATEX-Eval dataset, outperforming
existing methods. The performance also generalizes to image caption datasets,
Flickr8K-Expert and Flickr8K-CF, when viewing images as 1-frame videos.
Overall, VC-Inspector offers a scalable and generalizable solution for
evaluating the factual accuracy of video captions, paving the way for more
effective and objective assessment methodologies in diverse video domains.