MMIE：大規模多模態交錯理解基準測試集，用於大型視覺語言模型

摘要

交錯的多模式理解和生成已成為多模式學習中的一個關鍵領域，使模型能夠以任意順序生成和解釋圖像和文字。儘管取得了顯著進展，但對這種能力的評估仍然不足。現有的基準存在著數據規模、範圍和評估深度方面的限制，而目前的評估指標往往成本高昂或存在偏見，缺乏對實際應用的可靠性。為應對這些挑戰，我們引入了MMIE，這是一個大規模的知識密集型基準，用於評估大型視覺語言模型（LVLMs）中的交錯多模式理解和生成。MMIE包括20,000個精心策劃的多模式查詢，涵蓋3個類別、12個領域和102個子領域，包括數學、編碼、物理、文學、健康和藝術。它支持交錯輸入和輸出，提供了多種選擇和開放式問題格式的混合，以評估不同的能力。此外，我們提出了一個可靠的自動評估指標，利用人工標註數據和系統化評估標準來微調評分模型，旨在減少偏見並提高評估準確性。大量實驗證明了我們基準和指標在提供對交錯LVLMs的全面評估方面的有效性。具體而言，我們評估了八個LVLMs，揭示即使是最好的模型也有顯著的改進空間，大多數只達到了中等結果。我們相信MMIE將推動交錯LVLMs發展的進一步進步。我們在https://mmie-bench.github.io/ 公開發布了我們的基準和代碼。

English

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.

MMIE：大規模多模態交錯理解基準測試集，用於大型視覺語言模型

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

摘要

Support