ChatPaper.aiChatPaper

VideoEval-Pro:稳健且逼真的长视频理解评估

VideoEval-Pro: Robust and Realistic Long Video Understanding Evaluation

May 20, 2025
作者: Wentao Ma, Weiming Ren, Yiming Jia, Zhuofeng Li, Ping Nie, Ge Zhang, Wenhu Chen
cs.AI

摘要

近期,大型多模态模型(LMMs)作为长视频理解(LVU)的强大工具崭露头角,推动了标准化LVU基准的开发以评估其性能。然而,我们的研究揭示了现有LVU基准的一个严峻问题。首先,多数现有基准过度依赖多项选择题(MCQs),其评估结果因猜测正确答案的可能性而被夸大;其次,这些基准中相当一部分问题存在强烈先验,使得模型无需观看输入视频即可直接作答。例如,在Video-MME上,Gemini-1.5-Pro仅凭长视频中的随机一帧就能达到超过50%的准确率。我们还观察到,在现有基准上增加帧数并不必然带来性能提升,这一现象有悖常理。因此,当前LVU基准的有效性和鲁棒性受到削弱,阻碍了对LMMs长视频理解能力的真实评估。为解决这一问题,我们提出了VideoEval-Pro,一个包含开放式简答题的现实LVU基准,这些问题真正要求理解整个视频内容。VideoEval-Pro通过感知与推理任务,评估片段级和全视频理解能力。通过对21个专有及开源视频LMMs的评估,我们得出以下结论:(1) 视频LMMs在开放式问题上的表现相比MCQs有显著下降(>25%);(2) 令人惊讶的是,在VideoEval-Pro上,更高的MCQ得分并未带来更高的开放式问题得分;(3) 与其他MCQ基准相比,VideoEval-Pro从增加输入帧数中获益更多。我们的结果表明,VideoEval-Pro为长视频理解提供了更为现实和可靠的衡量标准,为该领域的进展提供了更清晰的视角。
English
Large multimodal models (LMMs) have recently emerged as a powerful tool for long video understanding (LVU), prompting the development of standardized LVU benchmarks to evaluate their performance. However, our investigation reveals a rather sober lesson for existing LVU benchmarks. First, most existing benchmarks rely heavily on multiple-choice questions (MCQs), whose evaluation results are inflated due to the possibility of guessing the correct answer; Second, a significant portion of questions in these benchmarks have strong priors to allow models to answer directly without even reading the input video. For example, Gemini-1.5-Pro can achieve over 50\% accuracy given a random frame from a long video on Video-MME. We also observe that increasing the number of frames does not necessarily lead to improvement on existing benchmarks, which is counterintuitive. As a result, the validity and robustness of current LVU benchmarks are undermined, impeding a faithful assessment of LMMs' long-video understanding capability. To tackle this problem, we propose VideoEval-Pro, a realistic LVU benchmark containing questions with open-ended short-answer, which truly require understanding the entire video. VideoEval-Pro assesses both segment-level and full-video understanding through perception and reasoning tasks. By evaluating 21 proprietary and open-source video LMMs, we conclude the following findings: (1) video LMMs show drastic performance (>25\%) drops on open-ended questions compared with MCQs; (2) surprisingly, higher MCQ scores do not lead to higher open-ended scores on VideoEval-Pro; (3) compared to other MCQ benchmarks, VideoEval-Pro benefits more from increasing the number of input frames. Our results show that VideoEval-Pro offers a more realistic and reliable measure of long video understanding, providing a clearer view of progress in this domain.

Summary

AI-Generated Summary

PDF101May 21, 2025