ChatPaper.aiChatPaper

MMLongBench:全面高效评估长上下文视觉语言模型的基准测试

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

May 15, 2025
作者: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
cs.AI

摘要

大型视觉语言模型上下文窗口的快速扩展催生了长上下文视觉语言模型(LCVLMs),这些模型能够在一个前向传播中处理数百张图像与交错排列的文本标记。在本研究中,我们推出了MMLongBench,这是首个涵盖多样化长上下文视觉语言任务的基准测试,旨在全面有效地评估LCVLMs。MMLongBench包含13,331个样本,跨越五大类下游任务,如视觉检索增强生成(Visual RAG)和多示例上下文学习(Many-Shot ICL)。它还广泛覆盖了多种图像类型,包括各类自然图像与合成图像。为了评估模型对不同输入长度的鲁棒性,所有样本均通过结合视觉块与文本标记的跨模态标记化方案,以五种标准化输入长度(8K至128K标记)呈现。通过对46个闭源与开源LCVLMs的深入基准测试,我们提供了当前模型在视觉语言长上下文能力上的全面分析。研究结果表明:i)单一任务的表现难以全面代表长上下文能力;ii)无论是闭源还是开源模型,在长上下文视觉语言任务中均面临挑战,表明未来有大幅提升空间;iii)具备更强推理能力的模型往往展现出更优的长上下文性能。通过提供广泛的任务覆盖、多样的图像类型及严格的长度控制,MMLongBench为诊断并推动下一代LCVLMs的发展奠定了不可或缺的基础。
English
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

Summary

AI-Generated Summary

PDF453May 19, 2025