MT-Video-Bench:一個全面的視頻理解基準,用於評估多輪對話中的多模態大語言模型
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
October 20, 2025
作者: Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu
cs.AI
摘要
近期多模态大型語言模型(MLLMs)的發展,顯著提升了人工智慧理解視覺模態的能力。然而,現有的評估基準仍局限於單輪問答,忽略了現實場景中多輪對話的複雜性。為彌補這一差距,我們引入了MT-Video-Bench,這是一個全面的視頻理解基準,用於評估MLLMs在多輪對話中的表現。具體而言,我們的MT-Video-Bench主要評估六項核心能力,這些能力聚焦於感知性和互動性,涵蓋了來自多個領域的987個精心策劃的多輪對話。這些能力嚴格對齊現實世界的應用,例如互動式體育分析和基於視頻的多輪智能輔導。通過MT-Video-Bench,我們廣泛評估了各種最先進的開源和閉源MLLMs,揭示了它們在處理多輪視頻對話時的顯著性能差異和局限性。該基準將公開提供,以促進未來的研究。
English
The recent development of Multimodal Large Language Models (MLLMs) has
significantly advanced AI's ability to understand visual modalities. However,
existing evaluation benchmarks remain limited to single-turn question
answering, overlooking the complexity of multi-turn dialogues in real-world
scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video
understanding benchmark for evaluating MLLMs in multi-turn dialogues.
Specifically, our MT-Video-Bench mainly assesses six core competencies that
focus on perceptivity and interactivity, encompassing 987 meticulously curated
multi-turn dialogues from diverse domains. These capabilities are rigorously
aligned with real-world applications, such as interactive sports analysis and
multi-turn video-based intelligent tutoring. With MT-Video-Bench, we
extensively evaluate various state-of-the-art open-source and closed-source
MLLMs, revealing their significant performance discrepancies and limitations in
handling multi-turn video dialogues. The benchmark will be publicly available
to foster future research.