TVBench：重新設計影片語言評估

摘要

大型語言模型已展示出令人印象深刻的表現，當與視覺模型整合時，甚至實現了視頻理解。然而，評估這些視頻模型存在獨特的挑戰，為此提出了幾個基準。本文顯示，目前最常用的視頻語言基準可以在不需要太多時間推理的情況下解決。我們在現有數據集中確定了三個主要問題：(i) 單幀的靜態信息通常足以解決任務 (ii) 問題和候選答案的文本過於具體，使模型能夠在不依賴任何視覺輸入的情況下正確回答 (iii) 僅憑世界知識就能回答許多問題，使基準測試變成知識複製而非視覺推理。此外，我們發現，針對視頻理解的開放式問答基準存在類似問題，而與大型語言模型的自動評估過程不可靠，因此不適合作為替代方案。作為解決方案，我們提出了TVBench，一個新穎的開源視頻多選問答基準，通過廣泛評估表明，它需要高水準的時間理解。令人驚訝的是，我們發現，大多數最新的視頻語言模型在TVBench上的表現與隨機表現相似，只有Gemini-Pro和Tarsier明顯優於此基準。

English

Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-language models perform similarly to random performance on TVBench, with only Gemini-Pro and Tarsier clearly surpassing this baseline.

TVBench：重新設計影片語言評估

TVBench: Redesigning Video-Language Evaluation

摘要

Support