MMMG：多任务多模态生成的全面可靠评估套件

摘要

自动评估多模态生成面临重大挑战，因为自动化指标往往难以与人类评估可靠对齐，尤其是在涉及多种模态的复杂任务中。为此，我们提出了MMMG，一个全面且与人类评估对齐的多模态生成基准，涵盖4种模态组合（图像、音频、图文交错、文音交错），重点关注对生成模型构成显著挑战的任务，同时通过模型与程序的结合实现可靠的自动评估。MMMG包含49项任务（其中29项为新开发），每项任务均配有精心设计的评估流程，以及937条指令，系统性地评估多模态生成模型在推理、可控性及其他关键能力上的表现。大量验证表明，MMMG与人类评估高度一致，平均一致率达94.3%。对24个多模态生成模型的基准测试结果显示，尽管当前最先进的GPT Image模型在图像生成上达到了78.3%的准确率，但在多模态推理和交错生成方面仍显不足。此外，结果表明音频生成领域存在显著的提升空间，为未来研究指明了重要方向。

English

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.

MMMG：多任务多模态生成的全面可靠评估套件

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

摘要

Support