Emu:在一堆干草中使用“光学针”来增强图像生成模型
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack
September 27, 2023
作者: Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh
cs.AI
摘要
使用网络规模的图像文本对进行训练文本到图像模型,可以生成各种视觉概念。然而,这些预训练模型在生成高度美学图像时经常面临挑战。这导致了在预训练后需要进行美学对齐。本文提出了质量微调,以有效引导预训练模型专门生成高度视觉吸引力图像,同时保持跨视觉概念的普遍性。我们的关键见解是,通过使用一组数量惊人少但极具视觉吸引力的图像进行监督微调,可以显著提高生成质量。我们在 11 亿图像文本对上预训练了一个潜在扩散模型,并仅使用几千张精心挑选的高质量图像进行微调。得到的模型 Emu 在视觉吸引力上的胜率为 82.9%,相较于仅预训练的对应模型。与最先进的 SDXLv1.0 相比,Emu 在视觉吸引力上在标准 PartiPrompts 和我们的基于真实世界文本到图像模型使用的 Open User Input 基准上分别被偏好 68.4% 和 71.3% 的时间。此外,我们展示了质量微调是一种通用方法,对其他架构也有效,包括像素扩散和遮蔽生成变压器模型。
English
Training text-to-image models with web scale image-text pairs enables the
generation of a wide range of visual concepts from text. However, these
pre-trained models often face challenges when it comes to generating highly
aesthetic images. This creates the need for aesthetic alignment post
pre-training. In this paper, we propose quality-tuning to effectively guide a
pre-trained model to exclusively generate highly visually appealing images,
while maintaining generality across visual concepts. Our key insight is that
supervised fine-tuning with a set of surprisingly small but extremely visually
appealing images can significantly improve the generation quality. We pre-train
a latent diffusion model on 1.1 billion image-text pairs and fine-tune it
with only a few thousand carefully selected high-quality images. The resulting
model, Emu, achieves a win rate of 82.9% compared with its pre-trained only
counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred
68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts
and our Open User Input benchmark based on the real-world usage of
text-to-image models. In addition, we show that quality-tuning is a generic
approach that is also effective for other architectures, including pixel
diffusion and masked generative transformer models.