视频简单问答：迈向大规模视频语言模型的事实性评估

摘要

近期，大型视频语言模型（LVLMs）的进展凸显了其在多模态理解方面的潜力，然而，评估其在视频情境中的事实准确性仍是一个亟待解决的关键挑战。为填补这一空白，我们推出了Video SimpleQA，这是首个专为LVLMs事实性评估量身定制的综合基准。我们的工作通过以下关键特性与现有视频基准区分开来：1）知识需求：要求整合超越显性叙述的外部知识；2）事实导向问题：针对客观、无争议的事件或关系，避免主观解读；3）明确且简短的答案：答案设计为无歧义且绝对正确的简短形式，便于通过LLM-as-a-judge框架进行自动化评估，评分差异最小化；4）外部来源验证：所有标注均经过与权威外部参考的严格比对，确保可靠性；5）时间推理需求：标注的问题类型涵盖静态单帧理解与动态时间推理，明确评估LVLMs在长上下文依赖下的准确性。我们对41个最先进的LVLMs进行了广泛评估，总结出以下关键发现：1）当前LVLMs在事实遵循方面存在显著不足，尤其是开源模型。表现最佳的Gemini-1.5-Pro模型仅获得54.4%的F分数；2）测试时计算范式带来的性能提升微乎其微，揭示了通过事后计算提升事实性的根本限制；3）检索增强生成（Retrieval-Augmented Generation）虽带来持续改进，但以额外推理时间为代价，呈现出效率与性能之间的关键权衡。

English

Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in video contexts remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the explicit narrative; 2) Fact-seeking question: targeting objective, undisputed events or relationships, avoiding subjective interpretation; 3) Definitive & short-form answer: Answers are crafted as unambiguous and definitively correct in a short format, enabling automated evaluation through LLM-as-a-judge frameworks with minimal scoring variance; 4) External-source verified: All annotations undergo rigorous validation against authoritative external references to ensure the reliability; 5) Temporal reasoning required: The annotated question types encompass both static single-frame understanding and dynamic temporal reasoning, explicitly evaluating LVLMs factuality under the long-context dependencies. We extensively evaluate 41 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, particularly for open-source models. The best-performing model Gemini-1.5-Pro achieves merely an F-score of 54.4%; 2) Test-time compute paradigms show insignificant performance gains, revealing fundamental constraints for enhancing factuality through post-hoc computation; 3) Retrieval-Augmented Generation demonstrates consistent improvements at the cost of additional inference time overhead, presenting a critical efficiency-performance trade-off.

视频简单问答：迈向大规模视频语言模型的事实性评估

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

摘要

Support