MMIG-Bench：迈向全面且可解释的多模态图像生成模型评估体系

摘要

近期如GPT-4o、Gemini 2.0 Flash和Gemini 2.5 Pro等多模态图像生成器在遵循复杂指令、图像编辑及概念一致性保持方面表现卓越。然而，它们仍由相互独立的工具包进行评估：缺乏多模态条件的文本到图像（T2I）基准测试，以及忽视组合语义与常识的定制化图像生成基准测试。为此，我们提出了MMIG-Bench，一个全面的多模态图像生成基准测试，通过将4,850个丰富注释的文本提示与1,750张多视角参考图像配对，覆盖380个主题，包括人类、动物、物体及艺术风格，统一了这些任务。MMIG-Bench配备了一个三级评估框架：（1）针对视觉伪影和对象身份保持的低级指标；（2）新颖的方面匹配分数（AMS）：一种基于视觉问答的中级指标，提供细粒度的提示-图像对齐，并显示出与人类判断的强相关性；（3）针对美学和人类偏好的高级指标。利用MMIG-Bench，我们对包括Gemini 2.5 Pro、FLUX、DreamBooth和IP-Adapter在内的17个顶尖模型进行了基准测试，并通过32,000次人类评分验证了我们的指标，深入剖析了架构与数据设计。我们将公开数据集和评估代码，以促进严谨、统一的评估，加速多模态图像生成领域的未来创新。

English

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.

MMIG-Bench：迈向全面且可解释的多模态图像生成模型评估体系

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

摘要

Support