ChatPaper.aiChatPaper

MMIG-Bench:迈向全面且可解释的多模态图像生成模型评估体系

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

May 26, 2025
作者: Hang Hua, Ziyun Zeng, Yizhi Song, Yunlong Tang, Liu He, Daniel Aliaga, Wei Xiong, Jiebo Luo
cs.AI

摘要

近期如GPT-4o、Gemini 2.0 Flash和Gemini 2.5 Pro等多模态图像生成器在遵循复杂指令、图像编辑及概念一致性保持方面表现卓越。然而,它们仍由相互独立的工具包进行评估:缺乏多模态条件的文本到图像(T2I)基准测试,以及忽视组合语义与常识的定制化图像生成基准测试。为此,我们提出了MMIG-Bench,一个全面的多模态图像生成基准测试,通过将4,850个丰富注释的文本提示与1,750张多视角参考图像配对,覆盖380个主题,包括人类、动物、物体及艺术风格,统一了这些任务。MMIG-Bench配备了一个三级评估框架:(1)针对视觉伪影和对象身份保持的低级指标;(2)新颖的方面匹配分数(AMS):一种基于视觉问答的中级指标,提供细粒度的提示-图像对齐,并显示出与人类判断的强相关性;(3)针对美学和人类偏好的高级指标。利用MMIG-Bench,我们对包括Gemini 2.5 Pro、FLUX、DreamBooth和IP-Adapter在内的17个顶尖模型进行了基准测试,并通过32,000次人类评分验证了我们的指标,深入剖析了架构与数据设计。我们将公开数据集和评估代码,以促进严谨、统一的评估,加速多模态图像生成领域的未来创新。
English
Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.

Summary

AI-Generated Summary

PDF22May 27, 2025