MMIE: 大規模なビジョン言語モデルのための大規模マルチモーダル交互理解ベンチマーク

要旨

交互された多モーダルの理解と生成は、モデルが画像とテキストを任意の順序で生成および解釈できるようにすることで、多モーダル学習における重要な分野となっています。重要な進歩がある一方で、この能力の評価は不十分なままです。既存のベンチマークはデータの規模、範囲、評価の深さに制限があり、現在の評価メトリクスはしばしばコストがかかったり偏りがあり、実用的なアプリケーションにおいて信頼性が欠如しています。これらの課題に対処するために、私たちはMMIEを導入します。これは、大規模な知識集約型ベンチマークであり、大規模ビジョン言語モデル（LVLMs）における交互された多モーダルの理解と生成を評価するためのものです。MMIEには、数学、コーディング、物理学、文学、健康、芸術などを含む、3つのカテゴリ、12のフィールド、102のサブフィールドにわたる、厳密にキュレーションされた多モーダルクエリが20,000件含まれています。これは、交互に入力と出力をサポートし、多肢選択と開放形式の質問形式の組み合わせを提供し、さまざまな能力を評価します。さらに、信頼性の高い自動評価メトリクスを提案し、人手による注釈付きデータと体系的な評価基準で微調整されたスコアリングモデルを活用して、偏りを減らし評価の精度を向上させることを目指しています。幅広い実験により、当社のベンチマークとメトリクスが交互LVLMsの包括的な評価を提供する効果を実証します。具体的には、8つのLVLMsを評価し、最も優れたモデルでも改善の余地があること、ほとんどのモデルが中程度の結果にとどまることが明らかになりました。MMIEが交互LVLMsの開発のさらなる進歩を促進すると信じています。当社のベンチマークとコードはhttps://mmie-bench.github.io/で公開されています。

English

Interleaved multimodal comprehension and generation, enabling models to produce and interpret both images and text in arbitrary sequences, have become a pivotal area in multimodal learning. Despite significant advancements, the evaluation of this capability remains insufficient. Existing benchmarks suffer from limitations in data scale, scope, and evaluation depth, while current evaluation metrics are often costly or biased, lacking in reliability for practical applications. To address these challenges, we introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs). MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies. Moreover, we propose a reliable automated evaluation metric, leveraging a scoring model fine-tuned with human-annotated data and systematic evaluation criteria, aimed at reducing bias and improving evaluation accuracy. Extensive experiments demonstrate the effectiveness of our benchmark and metrics in providing a comprehensive evaluation of interleaved LVLMs. Specifically, we evaluate eight LVLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. We believe MMIE will drive further advancements in the development of interleaved LVLMs. We publicly release our benchmark and code in https://mmie-bench.github.io/.

MMIE: 大規模なビジョン言語モデルのための大規模マルチモーダル交互理解ベンチマーク

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

要旨

Support