MMDU：LVLMs 的多輪多圖像對話理解基準和指示微調數據集

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

June 17, 2024

作者: Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

cs.AI

摘要

生成自然且有意義的回應以與多模態人類輸入進行溝通是大型視覺語言模型（LVLMs）的基本能力。儘管目前的開源LVLMs在簡化情境（如單輪單圖像輸入）中展現了令人期待的表現，但在現實世界的對話情境中（如在具有多輪和多圖像的長篇歷史中遵循指示）表現不佳。現有的LVLM基準主要聚焦於單選問題或簡短回應，無法充分評估LVLMs在現實世界人機互動應用中的能力。因此，我們介紹了MMDU，一個全面的基準測試，以及MMDU-45k，一個大規模指令調整數據集，旨在評估和提升LVLMs在多輪和多圖像對話中的能力。我們利用聚類算法從開源維基百科中找到相關圖像和文本描述，並由人類標註者在GPT-4o模型的協助下構建問答對。MMDU最多包含18k個圖像+文本標記、20張圖像和27輪對話，至少比以前的基準長5倍，對當前的LVLMs提出挑戰。我們對15個代表性LVLMs進行了深入分析，發現由於有限的對話指令調整數據，開源LVLMs落後於封閉源代表。我們證明，在MMDU-45k上微調開源LVLMs可以顯著填補這一差距，生成更長且更準確的對話，並提高MMDU和現有基準測試的得分（MMStar：+1.1％，MathVista：+1.5％，ChartQA：+1.2％）。我們的貢獻為拉近當前LVLM模型與現實應用需求之間的差距鋪平了道路。此項目可在https://github.com/Liuziyu77/MMDU中找到。

English

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.