MMIG-Bench:迈向全面且可解释的多模态图像生成模型评估体系
MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models
May 26, 2025
作者: Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo
cs.AI
摘要
近期如GPT-4o、Gemini 2.0 Flash和Gemini 2.5 Pro等多模态图像生成器在遵循复杂指令、图像编辑及概念一致性保持方面表现卓越。然而,它们仍由相互独立的工具包进行评估:缺乏多模态条件的文本到图像(T2I)基准测试,以及忽视组合语义与常识的定制化图像生成基准测试。为此,我们提出了MMIG-Bench,一个全面的多模态图像生成基准测试,通过将4,850个丰富注释的文本提示与1,750张多视角参考图像配对,覆盖380个主题,包括人类、动物、物体及艺术风格,统一了这些任务。MMIG-Bench配备了一个三级评估框架:(1)针对视觉伪影和对象身份保持的低级指标;(2)新颖的方面匹配分数(AMS):一种基于视觉问答的中级指标,提供细粒度的提示-图像对齐,并显示出与人类判断的强相关性;(3)针对美学和人类偏好的高级指标。利用MMIG-Bench,我们对包括Gemini 2.5 Pro、FLUX、DreamBooth和IP-Adapter在内的17个顶尖模型进行了基准测试,并通过32,000次人类评分验证了我们的指标,深入剖析了架构与数据设计。我们将公开数据集和评估代码,以促进严谨、统一的评估,加速多模态图像生成领域的未来创新。
English
Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and
Gemini 2.5 Pro excel at following complex instructions, editing images and
maintaining concept consistency. However, they are still evaluated by disjoint
toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning,
and customized image generation benchmarks that overlook compositional
semantics and common knowledge. We propose MMIG-Bench, a comprehensive
Multi-Modal Image Generation Benchmark that unifies these tasks by pairing
4,850 richly annotated text prompts with 1,750 multi-view reference images
across 380 subjects, spanning humans, animals, objects, and artistic styles.
MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level
metrics for visual artifacts and identity preservation of objects; (2) novel
Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers
fine-grained prompt-image alignment and shows strong correlation with human
judgments; and (3) high-level metrics for aesthetics and human preference.
Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5
Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human
ratings, yielding in-depth insights into architecture and data design. We will
release the dataset and evaluation code to foster rigorous, unified evaluation
and accelerate future innovations in multi-modal image generation.Summary
AI-Generated Summary