ビデオSimpleQA: 大規模ビデオ言語モデルにおける事実性評価に向けて

要旨

大規模ビデオ言語モデル（LVLM）の最近の進歩は、マルチモーダル理解におけるその可能性を浮き彫りにしているが、ビデオコンテキストにおける事実の基盤を評価することは依然として重要な未解決の課題である。このギャップを埋めるため、我々はLVLMの事実性評価に特化した最初の包括的なベンチマークであるVideo SimpleQAを導入する。我々の研究は、以下の主要な特徴を通じて既存のビデオベンチマークと区別される：1）必要な知識：明示的なナレーションを超えた外部知識の統合を要求する；2）事実探求型の質問：客観的で議論の余地のない事象や関係を対象とし、主観的な解釈を避ける；3）明確かつ短い形式の回答：回答は曖昧さがなく、短い形式で明確に正しいものとして作成され、LLM-as-a-judgeフレームワークを通じた自動評価を可能にし、スコアのばらつきを最小限に抑える；4）外部ソースによる検証：すべてのアノテーションは信頼性を確保するため、権威ある外部参照に対して厳密に検証される；5）時間的推論が必要：アノテーションされた質問タイプは、静的な単一フレーム理解と動的な時間的推論の両方を含み、長文脈依存性下でのLVLMの事実性を明示的に評価する。我々は41の最先端LVLMを広範に評価し、以下の主要な知見をまとめた：1）現在のLVLMは、特にオープンソースモデルにおいて、事実の遵守に顕著な欠陥を示す。最高性能のモデルGemini-1.5-Proでさえ、Fスコアはわずか54.4%である；2）テスト時の計算パラダイムは性能向上にほとんど寄与せず、事後計算を通じた事実性向上の根本的な制約を明らかにする；3）Retrieval-Augmented Generationは、追加の推論時間オーバーヘッドを伴うが、一貫した改善を示し、効率と性能の重要なトレードオフを提示する。

English

Recent advancements in Large Video Language Models (LVLMs) have highlighted their potential for multi-modal understanding, yet evaluating their factual grounding in video contexts remains a critical unsolved challenge. To address this gap, we introduce Video SimpleQA, the first comprehensive benchmark tailored for factuality evaluation of LVLMs. Our work distinguishes from existing video benchmarks through the following key features: 1) Knowledge required: demanding integration of external knowledge beyond the explicit narrative; 2) Fact-seeking question: targeting objective, undisputed events or relationships, avoiding subjective interpretation; 3) Definitive & short-form answer: Answers are crafted as unambiguous and definitively correct in a short format, enabling automated evaluation through LLM-as-a-judge frameworks with minimal scoring variance; 4) External-source verified: All annotations undergo rigorous validation against authoritative external references to ensure the reliability; 5) Temporal reasoning required: The annotated question types encompass both static single-frame understanding and dynamic temporal reasoning, explicitly evaluating LVLMs factuality under the long-context dependencies. We extensively evaluate 41 state-of-the-art LVLMs and summarize key findings as follows: 1) Current LVLMs exhibit notable deficiencies in factual adherence, particularly for open-source models. The best-performing model Gemini-1.5-Pro achieves merely an F-score of 54.4%; 2) Test-time compute paradigms show insignificant performance gains, revealing fundamental constraints for enhancing factuality through post-hoc computation; 3) Retrieval-Augmented Generation demonstrates consistent improvements at the cost of additional inference time overhead, presenting a critical efficiency-performance trade-off.

ビデオSimpleQA: 大規模ビデオ言語モデルにおける事実性評価に向けて

Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models

要旨

Support