MMIE：大规模多模态交织理解基准测试，用于大型视觉-语言模型

摘要

交错的多模态理解和生成已成为多模态学习中的一个关键领域，使模型能够以任意顺序生成和解释图像和文本。尽管取得了显著进展，但对这一能力的评估仍然不足。现有基准存在数据规模、范围和评估深度方面的限制，而当前的评估指标往往成本高昂或存在偏见，在实际应用中缺乏可靠性。为解决这些挑战，我们引入了MMIE，这是一个用于评估大型视觉语言模型（LVLMs）中交错的多模态理解和生成的大规模知识密集型基准。MMIE包括2万个精心策划的多模态查询，涵盖3个类别、12个领域和102个子领域，包括数学、编码、物理、文学、健康和艺术。它支持交错的输入和输出，提供了多种选择和开放式问题格式的混合，以评估不同的能力。此外，我们提出了一个可靠的自动化评估指标，利用经过人工注释的数据和系统化评估标准微调的评分模型，旨在减少偏见并提高评估准确性。大量实验证明了我们的基准和指标在提供对交错LVLMs的全面评估方面的有效性。具体来说，我们评估了八个LVLMs，揭示即使是最佳模型也有显著改进空间，大多数只能取得中等结果。我们相信MMIE将推动交错LVLMs发展的进一步进步。我们在https://mmie-bench.github.io/ 上公开发布了我们的基准和代码。

English

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.

MMIE：大规模多模态交织理解基准测试，用于大型视觉-语言模型

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

摘要

Support