MMMG：マルチタスク・マルチモーダル生成のための包括的かつ信頼性の高い評価スイート

要旨

マルチモーダル生成の自動評価は、特に複数のモダリティを伴う複雑なタスクにおいて、自動化された指標が人間の評価と信頼性高く一致することが難しいため、大きな課題となっている。この問題に対処するため、我々はMMMGを提案する。MMMGは、4つのモダリティ組み合わせ（画像、音声、テキストと画像の交互配置、テキストと音声の交互配置）にわたるマルチモーダル生成のための包括的かつ人間の評価に整合したベンチマークであり、生成モデルにとって重要な課題を提示するタスクに焦点を当てつつ、モデルとプログラムの組み合わせを通じて信頼性の高い自動評価を可能にする。MMMGは49のタスク（うち29は新規開発）を包含し、各タスクには慎重に設計された評価パイプラインと937の指示が含まれており、マルチモーダル生成モデルの推論能力、制御可能性、その他の重要な能力を体系的に評価する。広範な検証により、MMMGは人間の評価と高い整合性を示し、平均94.3%の一致率を達成していることが明らかになった。24のマルチモーダル生成モデルに対するベンチマーク結果は、最先端のモデルであるGPT Imageが画像生成において78.3%の精度を達成しているものの、マルチモーダル推論と交互配置生成においては不十分であることを示している。さらに、音声生成においては改善の余地が大きいことが示唆されており、今後の研究における重要な方向性を提示している。

English

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.