Emu：在一堆稻草中使用光學針來增強圖像生成模型

摘要

透過使用規模龐大的圖像-文字配對來訓練文圖模型，可以從文字中生成各種視覺概念。然而，這些預先訓練的模型在生成高度美學圖像時常面臨挑戰。這促使在預訓練後進行美學調整的需求。本文提出了品質微調方法，有效引導預先訓練的模型專門生成高度視覺吸引力的圖像，同時保持在視覺概念上的普遍性。我們的關鍵見解是，通過使用一組驚人小但極具視覺吸引力的圖像進行監督微調，可以顯著提高生成品質。我們在 11 億個圖像-文字配對上預先訓練了潛在擴散模型，並僅使用幾千張精心挑選的高質量圖像進行微調。所得模型 Emu 在視覺吸引力上的勝率為 82.9%，相較於僅預先訓練的對應模型。與最先進的 SDXLv1.0 相比，Emu 在視覺吸引力上在 PartiPrompts 標準和我們基於真實世界文圖模型使用情況的 Open User Input 基準測試中分別有 68.4% 和 71.3% 的偏好。此外，我們展示品質微調是一種通用方法，對於其他架構，包括像素擴散和遮罩生成變壓器模型，也同樣有效。

English

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

Emu：在一堆稻草中使用光學針來增強圖像生成模型

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

摘要

Support