MMBench-Video：一個針對整體影片理解的長格式多鏡頭基準測試

摘要

大視覺語言模型（LVLMs）的出現推動了對它們在多模態情境中的應用的研究，特別是在視頻理解方面。傳統的視頻問答基準盡管提供了量化指標，但往往無法涵蓋視頻內容的全部範疇，並且未能充分評估模型對時間的理解。為了解決這些限制，我們引入了MMBench-Video，這是一個設計用於嚴格評估LVLMs在視頻理解方面能力的量化基準。MMBench-Video包含來自YouTube的長視頻，並使用自由形式的問題，反映實際應用案例。這個基準被精心製作，以探究模型的時間推理能力，所有問題都是根據精心構建的能力分類人工標註的。我們使用GPT-4進行自動評估，展示了優於早期基於LLM的評估的準確性和韌性。利用MMBench-Video，我們進行了全面的評估，包括針對圖像和視頻的專有和開源LVLMs。MMBench-Video是研究社區的寶貴資源，有助於改進LVLMs的評估並促進視頻理解領域的進展。MMBench-Video的評估代碼將被整合到VLMEvalKit中：https://github.com/open-compass/VLMEvalKit。

English

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

MMBench-Video：一個針對整體影片理解的長格式多鏡頭基準測試

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

摘要

Support