视频简单问答:迈向大规模视频语言模型的事实性评估
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models
March 24, 2025
作者: Meng Cao, Pengfei Hu, Yingyao Wang, Jihao Gu, Haoran Tang, Haoze Zhao, Jiahua Dong, Wangbo Yu, Ge Zhang, Ian Reid, Xiaodan Liang
cs.AI
摘要
近期,大型视频语言模型(LVLMs)的进展凸显了其在多模态理解方面的潜力,然而,评估其在视频情境中的事实准确性仍是一个亟待解决的关键挑战。为填补这一空白,我们推出了Video SimpleQA,这是首个专为LVLMs事实性评估量身定制的综合基准。我们的工作通过以下关键特性与现有视频基准区分开来:1)知识需求:要求整合超越显性叙述的外部知识;2)事实导向问题:针对客观、无争议的事件或关系,避免主观解读;3)明确且简短的答案:答案设计为无歧义且绝对正确的简短形式,便于通过LLM-as-a-judge框架进行自动化评估,评分差异最小化;4)外部来源验证:所有标注均经过与权威外部参考的严格比对,确保可靠性;5)时间推理需求:标注的问题类型涵盖静态单帧理解与动态时间推理,明确评估LVLMs在长上下文依赖下的准确性。我们对41个最先进的LVLMs进行了广泛评估,总结出以下关键发现:1)当前LVLMs在事实遵循方面存在显著不足,尤其是开源模型。表现最佳的Gemini-1.5-Pro模型仅获得54.4%的F分数;2)测试时计算范式带来的性能提升微乎其微,揭示了通过事后计算提升事实性的根本限制;3)检索增强生成(Retrieval-Augmented Generation)虽带来持续改进,但以额外推理时间为代价,呈现出效率与性能之间的关键权衡。
English
Recent advancements in Large Video Language Models (LVLMs) have highlighted
their potential for multi-modal understanding, yet evaluating their factual
grounding in video contexts remains a critical unsolved challenge. To address
this gap, we introduce Video SimpleQA, the first comprehensive benchmark
tailored for factuality evaluation of LVLMs. Our work distinguishes from
existing video benchmarks through the following key features: 1) Knowledge
required: demanding integration of external knowledge beyond the explicit
narrative; 2) Fact-seeking question: targeting objective, undisputed events or
relationships, avoiding subjective interpretation; 3) Definitive & short-form
answer: Answers are crafted as unambiguous and definitively correct in a short
format, enabling automated evaluation through LLM-as-a-judge frameworks with
minimal scoring variance; 4) External-source verified: All annotations undergo
rigorous validation against authoritative external references to ensure the
reliability; 5) Temporal reasoning required: The annotated question types
encompass both static single-frame understanding and dynamic temporal
reasoning, explicitly evaluating LVLMs factuality under the long-context
dependencies. We extensively evaluate 41 state-of-the-art LVLMs and summarize
key findings as follows: 1) Current LVLMs exhibit notable deficiencies in
factual adherence, particularly for open-source models. The best-performing
model Gemini-1.5-Pro achieves merely an F-score of 54.4%; 2) Test-time compute
paradigms show insignificant performance gains, revealing fundamental
constraints for enhancing factuality through post-hoc computation; 3)
Retrieval-Augmented Generation demonstrates consistent improvements at the cost
of additional inference time overhead, presenting a critical
efficiency-performance trade-off.Summary
AI-Generated Summary