ChatPaper.aiChatPaper

MMDU:用于LVLMs的多轮多图像对话理解基准和指令调优数据集

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

June 17, 2024
作者: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang
cs.AI

摘要

生成自然且有意义的回复以与多模态人类输入进行交流是大型视觉语言模型(LVLMs)的基本能力。尽管当前开源的LVLMs在简化场景(如单轮单图像输入)中表现出有希望的性能,但在现实世界的对话场景(如在具有多轮和多图像的长篇历史中遵循指令)中表现不佳。现有的LVLM基准主要关注单选问题或简短回复,这并不能充分评估LVLMs在现实世界人机交互应用中的能力。因此,我们引入了MMDU,一个全面的基准测试,以及MMDU-45k,一个大规模指令调整数据集,旨在评估和提升LVLMs在多轮和多图像对话中的能力。我们使用聚类算法从开源维基百科中找到相关的图像和文本描述,并由人类注释者在GPT-4o模型的帮助下构建问题-回答对。MMDU最多包含18k个图像+文本标记、20个图像和27轮,至少比以前的基准长5倍,对当前的LVLMs提出了挑战。我们对使用MMDU的15个代表性LVLMs进行的深入分析显示,由于缺乏对话指令调整数据,开源LVLMs落后于闭源对手。我们证明,在MMDU-45k上对开源LVLMs进行微调显著弥补了这一差距,生成更长、更准确的对话,并提高了MMDU和现有基准测试的得分(MMStar:+1.1%,MathVista:+1.5%,ChartQA:+1.2%)。我们的贡献为弥合当前LVLM模型与现实世界应用需求之间的差距铺平了道路。该项目可在https://github.com/Liuziyu77/MMDU找到。
English
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.

Summary

AI-Generated Summary

PDF646December 6, 2024