ImagenHub: 条件付き画像生成モデルの評価を標準化する

要旨

近年、テキストから画像を生成するタスクや、テキストに基づく画像編集、特定の主題に基づく画像生成、制御ガイドによる画像生成など、さまざまな下流タスクに対応するための条件付き画像生成および編集モデルが多数開発されています。しかし、実験条件（データセット、推論、評価指標）に大きな不整合が見られ、公平な比較が困難な状況です。本論文では、すべての条件付き画像生成モデルの推論と評価を標準化するワンストップライブラリ「ImagenHub」を提案します。まず、7つの主要なタスクを定義し、それらに対する高品質な評価データセットを整備しました。次に、公平な比較を保証するための統一された推論パイプラインを構築しました。さらに、生成された画像を評価するための2つの人間による評価スコア、すなわち「意味的一貫性（Semantic Consistency）」と「知覚的品質（Perceptual Quality）」を設計し、包括的なガイドラインを策定しました。専門の評価者を訓練し、提案された指標に基づいてモデルの出力を評価しました。人間による評価では、76%のモデルにおいてKrippendorffのα値が0.4を超える高い評価者間一致率を達成しました。合計約30のモデルを包括的に評価し、以下の3つの重要な知見を得ました：(1) 既存のモデルの性能は、テキストガイドによる画像生成と主題駆動型画像生成を除いて、一般的に満足のいくものではなく、74%のモデルが全体スコア0.5未満でした。(2) 公開された論文の主張を検証したところ、83%が例外を除いて成立していました。(3) 主題駆動型画像生成を除いて、既存の自動評価指標のSpearman相関係数は0.2を超えるものはありませんでした。今後も、新たに発表されるモデルの評価を継続し、リーダーボードを更新して条件付き画像生成の進展を追跡していく予定です。

English

Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics - render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models' performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.

ImagenHub: 条件付き画像生成モデルの評価を標準化する

ImagenHub: Standardizing the evaluation of conditional image generation models

要旨

Support