ImagenHub:標準化條件圖像生成模型的評估
ImagenHub: Standardizing the evaluation of conditional image generation models
October 2, 2023
作者: Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, Wenhu Chen
cs.AI
摘要
最近,許多條件圖像生成和編輯模型已被開發用於不同的下游任務,包括文本到圖像生成、文本引導圖像編輯、主題驅動圖像生成、控制引導圖像生成等。然而,我們觀察到在實驗條件方面存在巨大的不一致性:數據集、推斷和評估指標等方面的不一致性使公平比較變得困難。本文提出了ImagenHub,這是一個一站式庫,用於標準化所有條件圖像生成模型的推斷和評估。首先,我們定義了七個突出的任務,並為它們精心挑選了高質量的評估數據集。其次,我們建立了統一的推斷流程,以確保公平比較。第三,我們設計了兩個人類評估分數,即語義一致性和感知質量,並提供了評估生成圖像的全面指南。我們訓練專家評估員根據提出的指標來評估模型輸出。我們的人類評估在76%的模型上實現了高達Krippendorff's alpha的工作者間一致性。我們全面評估了約30個模型,並觀察到三個關鍵結論:(1)現有模型的性能通常令人不滿,除了文本引導圖像生成和主題驅動圖像生成外,74%的模型的總體得分低於0.5。 (2)我們檢驗了已發表論文中的聲稱,發現83%的聲稱成立,但也有少數例外。 (3)除主題驅動圖像生成外,現有自動評估指標的Spearman's相關性均不高於0.2。展望未來,我們將繼續努力評估新發表的模型,並更新我們的排行榜以跟踪條件圖像生成領域的進展。
English
Recently, a myriad of conditional image generation and editing models have
been developed to serve different downstream tasks, including text-to-image
generation, text-guided image editing, subject-driven image generation,
control-guided image generation, etc. However, we observe huge inconsistencies
in experimental conditions: datasets, inference, and evaluation metrics -
render fair comparisons difficult. This paper proposes ImagenHub, which is a
one-stop library to standardize the inference and evaluation of all the
conditional image generation models. Firstly, we define seven prominent tasks
and curate high-quality evaluation datasets for them. Secondly, we built a
unified inference pipeline to ensure fair comparison. Thirdly, we design two
human evaluation scores, i.e. Semantic Consistency and Perceptual Quality,
along with comprehensive guidelines to evaluate generated images. We train
expert raters to evaluate the model outputs based on the proposed metrics. Our
human evaluation achieves a high inter-worker agreement of Krippendorff's alpha
on 76% models with a value higher than 0.4. We comprehensively evaluated a
total of around 30 models and observed three key takeaways: (1) the existing
models' performance is generally unsatisfying except for Text-guided Image
Generation and Subject-driven Image Generation, with 74% models achieving an
overall score lower than 0.5. (2) we examined the claims from published papers
and found 83% of them hold with a few exceptions. (3) None of the existing
automatic metrics has a Spearman's correlation higher than 0.2 except
subject-driven image generation. Moving forward, we will continue our efforts
to evaluate newly published models and update our leaderboard to keep track of
the progress in conditional image generation.