MMMG:多任務多模態生成之全面且可靠的評估套件
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
May 23, 2025
作者: Jihan Yao, Yushi Hu, Yujie Yi, Bin Han, Shangbin Feng, Guang Yang, Bingbing Wen, Ranjay Krishna, Lucy Lu Wang, Yulia Tsvetkov, Noah A. Smith, Banghua Zhu
cs.AI
摘要
自動評估多模態生成面臨著重大挑戰,因為自動化指標往往難以與人類評估可靠地保持一致,尤其是在涉及多種模態的複雜任務中。為解決這一問題,我們提出了MMMG,這是一個全面且與人類評估對齊的基準,涵蓋了四種模態組合(圖像、音頻、交錯文本與圖像、交錯文本與音頻),重點關注對生成模型構成顯著挑戰的任務,同時通過模型與程序的結合實現可靠的自動評估。MMMG包含49項任務(其中29項為新開發),每項任務均配備了精心設計的評估流程,以及937條指令,用以系統性地評估多模態生成模型的推理能力、可控性及其他關鍵能力。廣泛的驗證表明,MMMG與人類評估高度一致,平均一致率達94.3%。對24個多模態生成模型的基準測試結果顯示,儘管最先進的模型GPT Image在圖像生成上達到了78.3%的準確率,但在多模態推理和交錯生成方面仍顯不足。此外,結果表明音頻生成仍有顯著的改進空間,這為未來研究指明了一個重要方向。
English
Automatically evaluating multimodal generation presents a significant
challenge, as automated metrics often struggle to align reliably with human
evaluation, especially for complex tasks that involve multiple modalities. To
address this, we present MMMG, a comprehensive and human-aligned benchmark for
multimodal generation across 4 modality combinations (image, audio, interleaved
text and image, interleaved text and audio), with a focus on tasks that present
significant challenges for generation models, while still enabling reliable
automatic evaluation through a combination of models and programs. MMMG
encompasses 49 tasks (including 29 newly developed ones), each with a carefully
designed evaluation pipeline, and 937 instructions to systematically assess
reasoning, controllability, and other key capabilities of multimodal generation
models. Extensive validation demonstrates that MMMG is highly aligned with
human evaluation, achieving an average agreement of 94.3%. Benchmarking results
on 24 multimodal generation models reveal that even though the state-of-the-art
model, GPT Image, achieves 78.3% accuracy for image generation, it falls short
on multimodal reasoning and interleaved generation. Furthermore, results
suggest considerable headroom for improvement in audio generation, highlighting
an important direction for future research.Summary
AI-Generated Summary