テキストから画像生成モデルの包括的評価

要旨

最近のテキストから画像への生成モデルの驚異的な質的向上により、これらのモデルは広く注目され、採用されるようになりました。しかし、その能力とリスクについて包括的な定量的理解が不足しています。このギャップを埋めるため、我々は新しいベンチマーク「Holistic Evaluation of Text-to-Image Models（HEIM）」を導入します。従来の評価は主にテキストと画像の整合性と画像品質に焦点を当てていましたが、我々は12の側面を特定しました。これには、テキストと画像の整合性、画像品質、美的感覚、独創性、推論能力、知識、バイアス、毒性、公平性、堅牢性、多言語対応、効率性が含まれます。これらの側面を網羅する62のシナリオを策定し、26の最先端テキストから画像への生成モデルをこのベンチマークで評価しました。その結果、すべての側面で優れた単一のモデルは存在せず、異なるモデルが異なる強みを示すことが明らかになりました。生成された画像と人間による評価結果を完全な透明性のためにhttps://crfm.stanford.edu/heim/v1.1.0で公開し、HELMコードベースと統合されたコードをhttps://github.com/stanford-crfm/helmで公開しています。

English

The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.

テキストから画像生成モデルの包括的評価

Holistic Evaluation of Text-To-Image Models

要旨

Support