MMIG-Bench: マルチモーダル画像生成モデルの包括的かつ説明可能な評価に向けて

要旨

最近のマルチモーダル画像生成モデル、例えばGPT-4o、Gemini 2.0 Flash、Gemini 2.5 Proは、複雑な指示に従い、画像を編集し、概念の一貫性を維持する点で優れています。しかし、これらのモデルは依然として、マルチモーダル条件付けを欠くテキストから画像（T2I）ベンチマークや、構成的意味論や一般的な知識を見落とすカスタマイズされた画像生成ベンチマークといった、断片的なツールキットによって評価されています。我々は、MMIG-Benchという包括的なマルチモーダル画像生成ベンチマークを提案します。これは、4,850の詳細に注釈付けされたテキストプロンプトと、人間、動物、物体、芸術的スタイルにわたる380の主題にまたがる1,750のマルチビュー参照画像を組み合わせることで、これらのタスクを統合します。MMIG-Benchは、3段階の評価フレームワークを備えています：（1）視覚的アーティファクトや物体の同一性保持を評価する低レベルメトリクス、（2）VQAベースの中レベルメトリクスである新規のAspect Matching Score（AMS）：これは細かいプロンプトと画像の整合性を提供し、人間の判断と強い相関を示します、（3）美的感覚や人間の選好を評価する高レベルメトリクス。MMIG-Benchを使用して、Gemini 2.5 Pro、FLUX、DreamBooth、IP-Adapterを含む17の最先端モデルをベンチマークし、32kの人間による評価を用いてメトリクスを検証し、アーキテクチャとデータ設計に関する深い洞察を得ました。我々は、厳密で統一された評価を促進し、マルチモーダル画像生成の将来の革新を加速するために、データセットと評価コードを公開します。

English

Recent multimodal image generators such as GPT-4o, Gemini 2.0 Flash, and Gemini 2.5 Pro excel at following complex instructions, editing images and maintaining concept consistency. However, they are still evaluated by disjoint toolkits: text-to-image (T2I) benchmarks that lacks multi-modal conditioning, and customized image generation benchmarks that overlook compositional semantics and common knowledge. We propose MMIG-Bench, a comprehensive Multi-Modal Image Generation Benchmark that unifies these tasks by pairing 4,850 richly annotated text prompts with 1,750 multi-view reference images across 380 subjects, spanning humans, animals, objects, and artistic styles. MMIG-Bench is equipped with a three-level evaluation framework: (1) low-level metrics for visual artifacts and identity preservation of objects; (2) novel Aspect Matching Score (AMS): a VQA-based mid-level metric that delivers fine-grained prompt-image alignment and shows strong correlation with human judgments; and (3) high-level metrics for aesthetics and human preference. Using MMIG-Bench, we benchmark 17 state-of-the-art models, including Gemini 2.5 Pro, FLUX, DreamBooth, and IP-Adapter, and validate our metrics with 32k human ratings, yielding in-depth insights into architecture and data design. We will release the dataset and evaluation code to foster rigorous, unified evaluation and accelerate future innovations in multi-modal image generation.

MMIG-Bench: マルチモーダル画像生成モデルの包括的かつ説明可能な評価に向けて

MMIG-Bench: Towards Comprehensive and Explainable Evaluation of Multi-Modal Image Generation Models

要旨

Support