文本到圖像模型的整體評估

摘要

最近文本轉圖像模型的顯著質量改善引起了廣泛的關注和應用。然而，我們對它們的能力和風險缺乏全面的量化理解。為了填補這一空白，我們引入了一個新的基準，即文本轉圖像模型的整體評估（HEIM）。與先前的評估主要集中在文本-圖像對齊和圖像質量不同，我們確定了12個方面，包括文本-圖像對齊、圖像質量、美學、原創性、推理、知識、偏見、毒性、公平性、韌性、多語性和效率。我們精心挑選了62個涵蓋這些方面的場景，並在這個基準上評估了26個最先進的文本轉圖像模型。我們的結果顯示，沒有一個模型在所有方面表現出色，不同模型展現出不同的優勢。我們在https://crfm.stanford.edu/heim/v1.1.0上公開了生成的圖像和人工評估結果，並在https://github.com/stanford-crfm/helm上公開了代碼，該代碼已與HELM代碼庫集成。

English

The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.

文本到圖像模型的整體評估

Holistic Evaluation of Text-To-Image Models

摘要

Support