Emu：在一堆干草中使用“光学针”来增强图像生成模型

摘要

使用网络规模的图像文本对进行训练文本到图像模型，可以生成各种视觉概念。然而，这些预训练模型在生成高度美学图像时经常面临挑战。这导致了在预训练后需要进行美学对齐。本文提出了质量微调，以有效引导预训练模型专门生成高度视觉吸引力图像，同时保持跨视觉概念的普遍性。我们的关键见解是，通过使用一组数量惊人少但极具视觉吸引力的图像进行监督微调，可以显著提高生成质量。我们在 11 亿图像文本对上预训练了一个潜在扩散模型，并仅使用几千张精心挑选的高质量图像进行微调。得到的模型 Emu 在视觉吸引力上的胜率为 82.9%，相较于仅预训练的对应模型。与最先进的 SDXLv1.0 相比，Emu 在视觉吸引力上在标准 PartiPrompts 和我们的基于真实世界文本到图像模型使用的 Open User Input 基准上分别被偏好 68.4% 和 71.3% 的时间。此外，我们展示了质量微调是一种通用方法，对其他架构也有效，包括像素扩散和遮蔽生成变压器模型。

English

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

Emu：在一堆干草中使用“光学针”来增强图像生成模型

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

摘要

Support