ChatPaper.aiChatPaper

MMLongBench:全面且有效地评估长上下文视觉-语言模型的基准测试

MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

May 15, 2025
作者: Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
cs.AI

摘要

大型视觉语言模型上下文窗口的迅速扩展催生了长上下文视觉语言模型(LCVLMs),这些模型能够在单次前向传播中处理数百张图像与交错文本标记。在本研究中,我们引入了MMLongBench,这是首个涵盖多种长上下文视觉语言任务的基准测试,旨在有效且全面地评估LCVLMs。MMLongBench由13,331个样本组成,跨越了如视觉RAG和多示例ICL等五大类下游任务,并广泛覆盖了包括各类自然与合成图像在内的图像类型。为了评估模型对不同输入长度的鲁棒性,所有样本均通过结合视觉补丁与文本标记的跨模态标记化方案,以五种标准化输入长度(8K至128K标记)呈现。通过对46个闭源与开源LCVLMs的深入基准测试,我们提供了当前模型在视觉语言长上下文能力方面的全面分析。研究结果表明:i)单一任务上的表现难以全面反映长上下文能力;ii)无论是闭源还是开源模型,在长上下文视觉语言任务中均面临挑战,表明未来改进空间巨大;iii)具备更强推理能力的模型往往展现出更优的长上下文性能。通过提供广泛的任务覆盖、多样的图像类型以及严格的长度控制,MMLongBench为诊断与推进下一代LCVLMs的发展奠定了不可或缺的基础。
English
The rapid extension of context windows in large vision-language models has given rise to long-context vision-language models (LCVLMs), which are capable of handling hundreds of images with interleaved text tokens in a single forward pass. In this work, we introduce MMLongBench, the first benchmark covering a diverse set of long-context vision-language tasks, to evaluate LCVLMs effectively and thoroughly. MMLongBench is composed of 13,331 examples spanning five different categories of downstream tasks, such as Visual RAG and Many-Shot ICL. It also provides broad coverage of image types, including various natural and synthetic images. To assess the robustness of the models to different input lengths, all examples are delivered at five standardized input lengths (8K-128K tokens) via a cross-modal tokenization scheme that combines vision patches and text tokens. Through a thorough benchmarking of 46 closed-source and open-source LCVLMs, we provide a comprehensive analysis of the current models' vision-language long-context ability. Our results show that: i) performance on a single task is a weak proxy for overall long-context capability; ii) both closed-source and open-source models face challenges in long-context vision-language tasks, indicating substantial room for future improvement; iii) models with stronger reasoning ability tend to exhibit better long-context performance. By offering wide task coverage, various image types, and rigorous length control, MMLongBench provides the missing foundation for diagnosing and advancing the next generation of LCVLMs.

Summary

AI-Generated Summary

PDF453May 19, 2025