VideoEval-Pro: 堅牢で現実的な長尺動画理解評価

要旨

大規模マルチモーダルモデル（LMMs）は、長尺動画理解（LVU）のための強力なツールとして最近注目を集めており、その性能を評価するための標準化されたLVUベンチマークの開発が進められている。しかし、我々の調査によると、既存のLVUベンチマークには深刻な課題が存在することが明らかになった。第一に、ほとんどの既存ベンチマークは多肢選択問題（MCQs）に大きく依存しており、正解を推測する可能性があるため、評価結果が過大評価されている。第二に、これらのベンチマークに含まれる質問の多くは、入力動画を実際に見ることなくモデルが直接回答できる強い事前情報を持っている。例えば、Gemini-1.5-Proは、Video-MMEの長尺動画からランダムに選んだフレームを与えられた場合でも50％以上の精度を達成できる。また、フレーム数を増やしても既存のベンチマークでの性能向上が必ずしも見られないという直感に反する結果も観察された。その結果、現在のLVUベンチマークの有効性と頑健性が損なわれており、LMMsの長尺動画理解能力を忠実に評価することが困難となっている。この問題に対処するため、我々はVideoEval-Proを提案する。これは、動画全体を理解することを真に要求する、現実的なLVUベンチマークであり、自由記述形式の短答式問題を含んでいる。VideoEval-Proは、知覚と推論タスクを通じて、セグメントレベルおよび動画全体の理解を評価する。21の独自およびオープンソースの動画LMMsを評価した結果、以下の知見が得られた：（1）動画LMMsは、MCQsと比較して自由記述問題で大幅な性能低下（>25％）を示す；（2）驚くべきことに、MCQのスコアが高くても、VideoEval-Proでの自由記述スコアが高くなるわけではない；（3）他のMCQベンチマークと比較して、VideoEval-Proは入力フレーム数を増やすことによる恩恵がより大きい。我々の結果は、VideoEval-Proが長尺動画理解のより現実的で信頼性の高い測定を提供し、この分野の進歩をより明確に示すものであることを示している。

English

Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance (>25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.

VideoEval-Pro: 堅牢で現実的な長尺動画理解評価

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

要旨

Support