一幅图胜过千言万语:基于原则的重新描述改进图像生成
A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation
October 25, 2023
作者: Eyal Segalis, Dani Valevski, Danny Lumen, Yossi Matias, Yaniv Leviathan
cs.AI
摘要
在过去几年中,文本到图像扩散模型在能力上取得了显著的进展,实现了从文本提示中合成高质量且多样化的图像。然而,即使是最先进的模型通常也难以精确地遵循所有提示中的指令。这些模型中绝大多数是在由(图像,标题)对组成的数据集上进行训练的,其中图像通常来自网络,而标题则是它们的HTML替代文本。一个著名的例子是LAION数据集,被Stable Diffusion和其他模型使用。在这项工作中,我们观察到这些标题通常质量较低,并认为这显著影响了模型理解文本提示中微妙语义的能力。我们展示通过使用专门的自动字幕模型重新标记语料库,并在重新标记的数据集上训练文本到图像模型,模型在各方面都得到了显著的改进。首先,在整体图像质量方面:例如,FID为14.84,而基线为17.87,根据人类评估,忠实图像生成改善了64.3%。其次,在语义对齐方面,例如语义对象准确率为84.34,而78.90,计数对齐错误为1.32,而1.44,位置对齐为62.42,而57.60。我们分析了重新标记语料库的各种方法,并提供证据表明这种技术,我们称之为RECAP,既减少了训练-推断差异,又为模型提供了更多每个示例的信息,提高了样本效率,并使模型更好地理解标题和图像之间的关系。
English
Text-to-image diffusion models achieved a remarkable leap in capabilities
over the last few years, enabling high-quality and diverse synthesis of images
from a textual prompt. However, even the most advanced models often struggle to
precisely follow all of the directions in their prompts. The vast majority of
these models are trained on datasets consisting of (image, caption) pairs where
the images often come from the web, and the captions are their HTML alternate
text. A notable example is the LAION dataset, used by Stable Diffusion and
other models. In this work we observe that these captions are often of low
quality, and argue that this significantly affects the model's capability to
understand nuanced semantics in the textual prompts. We show that by relabeling
the corpus with a specialized automatic captioning model and training a
text-to-image model on the recaptioned dataset, the model benefits
substantially across the board. First, in overall image quality: e.g. FID 14.84
vs. the baseline of 17.87, and 64.3% improvement in faithful image generation
according to human evaluation. Second, in semantic alignment, e.g. semantic
object accuracy 84.34 vs. 78.90, counting alignment errors 1.32 vs. 1.44 and
positional alignment 62.42 vs. 57.60. We analyze various ways to relabel the
corpus and provide evidence that this technique, which we call RECAP, both
reduces the train-inference discrepancy and provides the model with more
information per example, increasing sample efficiency and allowing the model to
better understand the relations between captions and images.