MME-Unify：統一多模態理解與生成模型的綜合基準

摘要

現有的多模態大語言模型（MLLM）基準在評估統一多模態大語言模型（U-MLLMs）時面臨重大挑戰，原因在於：1）缺乏針對傳統任務的標準化基準，導致比較結果不一致；2）缺少混合模態生成的基準，無法有效評估多模態推理能力。我們提出了一個全面的評估框架，旨在系統性地評估U-MLLMs。我們的基準包括：1. 標準化傳統任務評估。我們從12個數據集中抽樣，涵蓋10個任務和30個子任務，確保研究間的一致性和公平比較。2. 統一任務評估。我們引入了五項新穎任務來測試多模態推理，包括圖像編輯、基於圖像生成的常識問答以及幾何推理。3. 全面模型基準測試。我們評估了12個領先的U-MLLMs，如Janus-Pro、EMU3、VILA-U和Gemini2-flash，同時也涵蓋了專注於理解（如Claude-3.5-Sonnet）和生成（如DALL-E-3）的模型。我們的研究結果揭示了現有U-MLLMs在性能上的顯著差距，強調了需要開發更強大的模型以有效處理混合模態任務。代碼和評估數據可在https://mme-unify.github.io/獲取。

English

Existing MLLM benchmarks face significant challenges in evaluating Unified MLLMs (U-MLLMs) due to: 1) lack of standardized benchmarks for traditional tasks, leading to inconsistent comparisons; 2) absence of benchmarks for mixed-modality generation, which fails to assess multimodal reasoning capabilities. We present a comprehensive evaluation framework designed to systematically assess U-MLLMs. Our benchmark includes: Standardized Traditional Task Evaluation. We sample from 12 datasets, covering 10 tasks with 30 subtasks, ensuring consistent and fair comparisons across studies." 2. Unified Task Assessment. We introduce five novel tasks testing multimodal reasoning, including image editing, commonsense QA with image generation, and geometric reasoning. 3. Comprehensive Model Benchmarking. We evaluate 12 leading U-MLLMs, such as Janus-Pro, EMU3, VILA-U, and Gemini2-flash, alongside specialized understanding (e.g., Claude-3.5-Sonnet) and generation models (e.g., DALL-E-3). Our findings reveal substantial performance gaps in existing U-MLLMs, highlighting the need for more robust models capable of handling mixed-modality tasks effectively. The code and evaluation data can be found in https://mme-unify.github.io/.

MME-Unify：統一多模態理解與生成模型的綜合基準

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

摘要

Support