MMMG：多任務多模態生成之全面且可靠的評估套件

摘要

自動評估多模態生成面臨著重大挑戰，因為自動化指標往往難以與人類評估可靠地保持一致，尤其是在涉及多種模態的複雜任務中。為解決這一問題，我們提出了MMMG，這是一個全面且與人類評估對齊的基準，涵蓋了四種模態組合（圖像、音頻、交錯文本與圖像、交錯文本與音頻），重點關注對生成模型構成顯著挑戰的任務，同時通過模型與程序的結合實現可靠的自動評估。MMMG包含49項任務（其中29項為新開發），每項任務均配備了精心設計的評估流程，以及937條指令，用以系統性地評估多模態生成模型的推理能力、可控性及其他關鍵能力。廣泛的驗證表明，MMMG與人類評估高度一致，平均一致率達94.3%。對24個多模態生成模型的基準測試結果顯示，儘管最先進的模型GPT Image在圖像生成上達到了78.3%的準確率，但在多模態推理和交錯生成方面仍顯不足。此外，結果表明音頻生成仍有顯著的改進空間，這為未來研究指明了一個重要方向。

English

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.