MMBench-Video:一個針對整體影片理解的長格式多鏡頭基準測試
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
June 20, 2024
作者: Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, Kai Chen
cs.AI
摘要
大視覺語言模型(LVLMs)的出現推動了對它們在多模態情境中的應用的研究,特別是在視頻理解方面。傳統的視頻問答基準盡管提供了量化指標,但往往無法涵蓋視頻內容的全部範疇,並且未能充分評估模型對時間的理解。為了解決這些限制,我們引入了MMBench-Video,這是一個設計用於嚴格評估LVLMs在視頻理解方面能力的量化基準。MMBench-Video包含來自YouTube的長視頻,並使用自由形式的問題,反映實際應用案例。這個基準被精心製作,以探究模型的時間推理能力,所有問題都是根據精心構建的能力分類人工標註的。我們使用GPT-4進行自動評估,展示了優於早期基於LLM的評估的準確性和韌性。利用MMBench-Video,我們進行了全面的評估,包括針對圖像和視頻的專有和開源LVLMs。MMBench-Video是研究社區的寶貴資源,有助於改進LVLMs的評估並促進視頻理解領域的進展。MMBench-Video的評估代碼將被整合到VLMEvalKit中:https://github.com/open-compass/VLMEvalKit。
English
The advent of large vision-language models (LVLMs) has spurred research into
their applications in multi-modal contexts, particularly in video
understanding. Traditional VideoQA benchmarks, despite providing quantitative
metrics, often fail to encompass the full spectrum of video content and
inadequately assess models' temporal comprehension. To address these
limitations, we introduce MMBench-Video, a quantitative benchmark designed to
rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video
incorporates lengthy videos from YouTube and employs free-form questions,
mirroring practical use cases. The benchmark is meticulously crafted to probe
the models' temporal reasoning skills, with all questions human-annotated
according to a carefully constructed ability taxonomy. We employ GPT-4 for
automated assessment, demonstrating superior accuracy and robustness over
earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted
comprehensive evaluations that include both proprietary and open-source LVLMs
for images and videos. MMBench-Video stands as a valuable resource for the
research community, facilitating improved evaluation of LVLMs and catalyzing
progress in the field of video understanding. The evalutation code of
MMBench-Video will be integrated into VLMEvalKit:
https://github.com/open-compass/VLMEvalKit.Summary
AI-Generated Summary