MMMG:多任务多模态生成的全面可靠评估套件
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
May 23, 2025
作者: Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu
cs.AI
摘要
自动评估多模态生成面临重大挑战,因为自动化指标往往难以与人类评估可靠对齐,尤其是在涉及多种模态的复杂任务中。为此,我们提出了MMMG,一个全面且与人类评估对齐的多模态生成基准,涵盖4种模态组合(图像、音频、图文交错、文音交错),重点关注对生成模型构成显著挑战的任务,同时通过模型与程序的结合实现可靠的自动评估。MMMG包含49项任务(其中29项为新开发),每项任务均配有精心设计的评估流程,以及937条指令,系统性地评估多模态生成模型在推理、可控性及其他关键能力上的表现。大量验证表明,MMMG与人类评估高度一致,平均一致率达94.3%。对24个多模态生成模型的基准测试结果显示,尽管当前最先进的GPT Image模型在图像生成上达到了78.3%的准确率,但在多模态推理和交错生成方面仍显不足。此外,结果表明音频生成领域存在显著的提升空间,为未来研究指明了重要方向。
English
Automatically evaluating multimodal generation presents a significant
challenge, as automated metrics often struggle to align reliably with human
evaluation, especially for complex tasks that involve multiple modalities. To
address this, we present MMMG, a comprehensive and human-aligned benchmark for
multimodal generation across 4 modality combinations (image, audio, interleaved
text and image, interleaved text and audio), with a focus on tasks that present
significant challenges for generation models, while still enabling reliable
automatic evaluation through a combination of models and programs. MMMG
encompasses 49 tasks (including 29 newly developed ones), each with a carefully
designed evaluation pipeline, and 937 instructions to systematically assess
reasoning, controllability, and other key capabilities of multimodal generation
models. Extensive validation demonstrates that MMMG is highly aligned with
human evaluation, achieving an average agreement of 94.3%. Benchmarking results
on 24 multimodal generation models reveal that even though the state-of-the-art
model, GPT Image, achieves 78.3% accuracy for image generation, it falls short
on multimodal reasoning and interleaved generation. Furthermore, results
suggest considerable headroom for improvement in audio generation, highlighting
an important direction for future research.Summary
AI-Generated Summary