文本到图像模型的整体评估

摘要

最近文本到图像模型在质量上取得了惊人的进步，引起了广泛关注和采用。然而，我们对它们的能力和风险缺乏全面的定量理解。为了填补这一空白，我们引入了一个新的基准，即文本到图像模型的整体评估（HEIM）。与先前的评估主要关注文本-图像对齐和图像质量不同，我们确定了12个方面，包括文本-图像对齐、图像质量、美学、原创性、推理、知识、偏见、毒性、公平性、鲁棒性、多语言性和效率。我们策划了62个涵盖这些方面的场景，并在这一基准上评估了26个最先进的文本到图像模型。我们的结果显示，没有单一模型在所有方面都表现出色，不同模型展示了不同的优势。我们公开发布了生成的图像和人类评估结果，以确保完全透明，网址为https://crfm.stanford.edu/heim/v1.1.0，代码托管在https://github.com/stanford-crfm/helm，并与HELM代码库集成。

English

The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.

文本到图像模型的整体评估

Holistic Evaluation of Text-To-Image Models

摘要

Support