通過Gecko重新審視文本到圖像的評估:關於指標、提示和人類評分
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
April 25, 2024
作者: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh
cs.AI
摘要
儘管文字轉圖像(T2I)生成模型已變得無處不在,但它們並不一定生成與給定提示相符的圖像。先前的研究通過提出度量標準、基準和模板來評估T2I對齊性,但這些組件的質量並沒有得到系統性的評估。人工評分的提示集通常很小,評分的可靠性以及用於比較模型的提示集也沒有得到評估。我們通過進行一項廣泛的研究,評估自動評估指標和人工模板來填補這一空白。我們提供三個主要貢獻:(1)我們引入了一個全面的基於技能的基準,可以區分不同人工模板下的模型。這個基於技能的基準將提示劃分為子技能,使從業者不僅可以找出哪些技能具有挑戰性,還可以找出技能在什麼複雜程度下變得具有挑戰性。(2)我們跨四個模板和四個T2I模型收集了人工評分,總計超過100K個標註。這使我們能夠了解由於提示中固有的模棱兩可性而引起的差異,以及由於度量標準和模型質量的差異而引起的差異。(3)最後,我們引入了一個新的基於QA的自動評估指標,與我們的新數據集上現有指標相比,與不同的人工模板以及TIFA160上的人工評分更相關。
English
While text-to-image (T2I) generative models have become ubiquitous, they do
not necessarily generate images that align with a given prompt. While previous
work has evaluated T2I alignment by proposing metrics, benchmarks, and
templates for collecting human judgements, the quality of these components is
not systematically measured. Human-rated prompt sets are generally small and
the reliability of the ratings -- and thereby the prompt set used to compare
models -- is not evaluated. We address this gap by performing an extensive
study evaluating auto-eval metrics and human templates. We provide three main
contributions: (1) We introduce a comprehensive skills-based benchmark that can
discriminate models across different human templates. This skills-based
benchmark categorises prompts into sub-skills, allowing a practitioner to
pinpoint not only which skills are challenging, but at what level of complexity
a skill becomes challenging. (2) We gather human ratings across four templates
and four T2I models for a total of >100K annotations. This allows us to
understand where differences arise due to inherent ambiguity in the prompt and
where they arise due to differences in metric and model quality. (3) Finally,
we introduce a new QA-based auto-eval metric that is better correlated with
human ratings than existing metrics for our new dataset, across different human
templates, and on TIFA160.Summary
AI-Generated Summary