MMBench-Video: 통합적 비디오 이해를 위한 장편 멀티샷 벤치마크

초록

대규모 시각-언어 모델(LVLMs)의 등장은 다중 모달리티 환경, 특히 비디오 이해 분야에서의 응용 연구를 촉진했습니다. 기존의 VideoQA 벤치마크는 정량적 지표를 제공하지만, 비디오 콘텐츠의 전체 스펙트럼을 포괄하지 못하며 모델의 시간적 이해력을 충분히 평가하지 못하는 경우가 많습니다. 이러한 한계를 해결하기 위해, 우리는 LVLMs의 비디오 이해 능력을 엄격하게 평가하기 위해 설계된 정량적 벤치마크인 MMBench-Video를 소개합니다. MMBench-Video는 YouTube의 긴 비디오를 포함하고 실용적인 사용 사례를 반영한 자유 형식 질문을 사용합니다. 이 벤치마크는 모델의 시간적 추론 능력을 탐구하기 위해 세심하게 설계되었으며, 모든 질문은 신중하게 구성된 능력 분류 체계에 따라 인간이 주석을 달았습니다. 우리는 GPT-4를 자동 평가에 활용하여, 이전의 LLM 기반 평가보다 우수한 정확성과 견고성을 입증했습니다. MMBench-Video를 활용하여, 우리는 이미지와 비디오를 위한 독점 및 오픈소스 LVLMs를 모두 포함한 포괄적인 평가를 수행했습니다. MMBench-Video는 연구 커뮤니티에 귀중한 자원으로, LVLMs의 평가를 개선하고 비디오 이해 분야의 진전을 촉진합니다. MMBench-Video의 평가 코드는 VLMEvalKit에 통합될 예정입니다: https://github.com/open-compass/VLMEvalKit.

English

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.

MMBench-Video: 통합적 비디오 이해를 위한 장편 멀티샷 벤치마크

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

초록

Support