MT-Video-Bench:一个全面视频理解基准,用于评估多轮对话中的多模态大语言模型
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues
October 20, 2025
作者: Yaning Pan, Zekun Wang, Qianqian Xie, Yongqian Wen, Yuanxing Zhang, Guohui Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Tianhao Peng, Jiaheng Liu
cs.AI
摘要
近期,多模态大语言模型(MLLMs)的发展显著提升了人工智能理解视觉模态的能力。然而,现有的评估基准仍局限于单轮问答,忽视了现实场景中多轮对话的复杂性。为填补这一空白,我们推出了MT-Video-Bench,一个全面的视频理解基准,用于评估MLLMs在多轮对话中的表现。具体而言,MT-Video-Bench主要评估六项核心能力,这些能力聚焦于感知与交互,涵盖了从多个领域精心挑选的987段多轮对话。这些能力严格对应实际应用场景,如互动体育分析和基于视频的多轮智能辅导。借助MT-Video-Bench,我们对多种最先进的开源与闭源MLLMs进行了广泛评估,揭示了它们在处理多轮视频对话时的显著性能差异与局限。该基准将公开提供,以促进未来研究。
English
The recent development of Multimodal Large Language Models (MLLMs) has
significantly advanced AI's ability to understand visual modalities. However,
existing evaluation benchmarks remain limited to single-turn question
answering, overlooking the complexity of multi-turn dialogues in real-world
scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video
understanding benchmark for evaluating MLLMs in multi-turn dialogues.
Specifically, our MT-Video-Bench mainly assesses six core competencies that
focus on perceptivity and interactivity, encompassing 987 meticulously curated
multi-turn dialogues from diverse domains. These capabilities are rigorously
aligned with real-world applications, such as interactive sports analysis and
multi-turn video-based intelligent tutoring. With MT-Video-Bench, we
extensively evaluate various state-of-the-art open-source and closed-source
MLLMs, revealing their significant performance discrepancies and limitations in
handling multi-turn video dialogues. The benchmark will be publicly available
to foster future research.