텍스트-이미지 모델의 종합적 평가

초록

최근 텍스트-이미지 모델의 놀라운 질적 개선은 이들에 대한 광범위한 관심과 채택으로 이어졌습니다. 그러나 우리는 이들의 능력과 위험에 대한 포괄적인 정량적 이해가 부족한 상황입니다. 이러한 격차를 메우기 위해, 우리는 새로운 벤치마크인 '텍스트-이미지 모델의 종합적 평가(Holistic Evaluation of Text-to-Image Models, HEIM)'를 소개합니다. 기존 평가가 주로 텍스트-이미지 정렬과 이미지 품질에 초점을 맞췄던 반면, 우리는 텍스트-이미지 정렬, 이미지 품질, 미적 요소, 독창성, 추론 능력, 지식, 편향, 유해성, 공정성, 견고성, 다국어 지원, 효율성 등 12가지 측면을 식별했습니다. 우리는 이러한 측면을 포괄하는 62개의 시나리오를 구성하고, 이 벤치마크에서 26개의 최첨단 텍스트-이미지 모델을 평가했습니다. 우리의 결과는 단일 모델이 모든 측면에서 뛰어나지 않으며, 각 모델이 서로 다른 강점을 보인다는 것을 보여줍니다. 우리는 생성된 이미지와 인간 평가 결과를 https://crfm.stanford.edu/heim/v1.1.0에서, 그리고 HELM 코드베이스와 통합된 코드를 https://github.com/stanford-crfm/helm에서 공개하여 완전한 투명성을 제공합니다.

English

The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.

텍스트-이미지 모델의 종합적 평가

Holistic Evaluation of Text-To-Image Models

초록

Support