ChatPaper.aiChatPaper

视频语言评估的重新设计:TVBench

TVBench: Redesigning Video-Language Evaluation

October 10, 2024
作者: Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees G. M. Snoek, Yuki M. Asano
cs.AI

摘要

大型语言模型已经展示出与视觉模型集成甚至实现视频理解时的出色性能。然而,评估这些视频模型存在独特的挑战,为此已经提出了几个基准。在本文中,我们表明目前大多数使用的视频-语言基准可以在不需要太多时间推理的情况下解决。我们确定了现有数据集中的三个主要问题:(i) 单帧的静态信息通常足以解决任务;(ii) 问题和候选答案的文本过于具体,使模型能够在不依赖任何视觉输入的情况下正确回答;(iii) 仅凭世界知识就能回答许多问题,使基准成为对知识复制而非视觉推理的测试。此外,我们发现针对视频理解的开放式问答基准也存在类似问题,而使用大型语言模型进行的自动评估过程不可靠,因此不适合作为替代方案。作为解决方案,我们提出了TVBench,这是一个新颖的开源视频多项选择问答基准,并通过广泛评估表明它需要高水平的时间理解。令人惊讶的是,我们发现大多数最新的视频-语言模型在TVBench上的表现与随机表现类似,只有Gemini-Pro和Tarsier明显超过了这一基线。
English
Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-language models perform similarly to random performance on TVBench, with only Gemini-Pro and Tarsier clearly surpassing this baseline.
PDF62November 16, 2024