MME-Unify：統合マルチモーダル理解と生成モデルのための包括的ベンチマーク

要旨

既存のMLLM（マルチモーダル大規模言語モデル）ベンチマークは、統一型MLLM（U-MLLM）の評価において以下の理由から重大な課題に直面しています：1）伝統的なタスクに対する標準化されたベンチマークが欠如しており、一貫性のある比較が困難であること、2）混合モダリティ生成のベンチマークが存在せず、マルチモーダル推論能力を適切に評価できないこと。本論文では、U-MLLMを体系的に評価するための包括的な評価フレームワークを提案します。我々のベンチマークは以下の要素を含みます：1. 標準化された伝統的タスク評価。12のデータセットからサンプリングし、10のタスクと30のサブタスクをカバーすることで、研究間での一貫性と公平性を確保します。2. 統一タスク評価。画像編集、画像生成を伴う常識的質問応答、幾何学的推論など、マルチモーダル推論をテストする5つの新規タスクを導入します。3. 包括的モデルベンチマーク。Janus-Pro、EMU3、VILA-U、Gemini2-flashなど12の主要なU-MLLMを、専門的な理解モデル（例：Claude-3.5-Sonnet）や生成モデル（例：DALL-E-3）とともに評価します。我々の調査結果は、既存のU-MLLMにおける大幅な性能差を明らかにし、混合モダリティタスクを効果的に処理できるより堅牢なモデルの必要性を強調しています。コードと評価データはhttps://mme-unify.github.io/で公開されています。

English

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.

MME-Unify：統合マルチモーダル理解と生成モデルのための包括的ベンチマーク

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

要旨

Support