UniGenBench++：テキストから画像生成のための統合セマンティック評価ベンチマーク

要旨

テキストから画像（T2I）生成における最近の進展は、生成された画像がテキストプロンプトの意味をどれだけ正確に反映しているかを評価するための信頼性の高いベンチマークの重要性を強調しています。しかし、(1) 既存のベンチマークは、実世界での適用性に不可欠な多様なプロンプトシナリオと多言語サポートを欠いている。(2) 主要な次元にわたる粗い評価しか提供せず、細かいサブ次元の範囲が狭く、詳細なサブ次元評価が不十分である。これらの制限に対処するため、我々はT2I生成のための統一的な意味評価ベンチマークであるUniGenBench++を導入します。具体的には、600のプロンプトを階層的に組織化し、カバレッジと効率性を確保しています：(1) 5つの主要なプロンプトテーマと20のサブテーマにわたる多様な実世界シナリオを網羅。(2) 10の主要評価基準と27のサブ評価基準にわたってT2Iモデルの意味的一貫性を包括的に探り、各プロンプトが複数のテストポイントを評価します。モデルの言語とプロンプト長の変動に対する堅牢性を厳密に評価するため、各プロンプトの英語版と中国語版を短い形式と長い形式で提供します。クローズドソースのマルチモーダル大規模言語モデル（MLLM）、すなわちGemini-2.5-Proの一般的な世界知識と詳細な画像理解能力を活用し、信頼性の高いベンチマーク構築と効率的なモデル評価のためのパイプラインを開発しました。さらに、コミュニティの利用をさらに促進するため、T2Iモデルの出力をオフラインで評価可能な堅牢な評価モデルを訓練します。オープンソースおよびクローズドソースのT2Iモデルの包括的なベンチマークを通じて、それらの強みと弱みを様々な側面から体系的に明らかにします。

English

Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.

UniGenBench++：テキストから画像生成のための統合セマンティック評価ベンチマーク

UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

要旨

Support