通过Gecko重新审视文本到图像的评估:关于指标、提示和人类评分。
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
April 25, 2024
作者: Olivia Wiles, Chuhan Zhang, Isabela Albuquerque, Ivana Kajić, Su Wang, Emanuele Bugliarello, Yasumasa Onoe, Chris Knutsen, Cyrus Rashtchian, Jordi Pont-Tuset, Aida Nematzadeh
cs.AI
摘要
尽管文本到图像(T2I)生成模型已经变得无处不在,但它们不一定生成与给定提示相符的图像。先前的研究通过提出度量标准、基准测试和用于收集人类判断的模板来评估T2I的对齐性,但这些组件的质量并未得到系统性的衡量。人工评定的提示集通常较小,评分的可靠性以及用于比较模型的提示集也未经评估。我们通过进行广泛的研究,评估自动评估度量和人类模板来填补这一空白。我们提供三个主要贡献:(1)我们引入了一个全面的基于技能的基准测试,可以区分不同人类模板下的模型。这个基于技能的基准测试将提示分为不同的子技能,使从业者不仅能够确定哪些技能具有挑战性,还能确定技能在何种复杂程度下变得具有挑战性。(2)我们收集了四个模板和四个T2I模型的人类评分,总计超过100K个注释。这使我们能够了解差异是由于提示中固有的歧义还是由于度量标准和模型质量的差异引起的。(3)最后,我们引入了一种新的基于问答的自动评估度量,与我们的新数据集、不同的人类模板以及TIFA160上的现有度量相比,与人类评分更相关。
English
While text-to-image (T2I) generative models have become ubiquitous, they do
not necessarily generate images that align with a given prompt. While previous
work has evaluated T2I alignment by proposing metrics, benchmarks, and
templates for collecting human judgements, the quality of these components is
not systematically measured. Human-rated prompt sets are generally small and
the reliability of the ratings -- and thereby the prompt set used to compare
models -- is not evaluated. We address this gap by performing an extensive
study evaluating auto-eval metrics and human templates. We provide three main
contributions: (1) We introduce a comprehensive skills-based benchmark that can
discriminate models across different human templates. This skills-based
benchmark categorises prompts into sub-skills, allowing a practitioner to
pinpoint not only which skills are challenging, but at what level of complexity
a skill becomes challenging. (2) We gather human ratings across four templates
and four T2I models for a total of >100K annotations. This allows us to
understand where differences arise due to inherent ambiguity in the prompt and
where they arise due to differences in metric and model quality. (3) Finally,
we introduce a new QA-based auto-eval metric that is better correlated with
human ratings than existing metrics for our new dataset, across different human
templates, and on TIFA160.Summary
AI-Generated Summary