ChatPaper.aiChatPaper

Emu:在一堆稻草中使用光學針來增強圖像生成模型

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

September 27, 2023
作者: Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang, Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu, Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh
cs.AI

摘要

透過使用規模龐大的圖像-文字配對來訓練文圖模型,可以從文字中生成各種視覺概念。然而,這些預先訓練的模型在生成高度美學圖像時常面臨挑戰。這促使在預訓練後進行美學調整的需求。本文提出了品質微調方法,有效引導預先訓練的模型專門生成高度視覺吸引力的圖像,同時保持在視覺概念上的普遍性。我們的關鍵見解是,通過使用一組驚人小但極具視覺吸引力的圖像進行監督微調,可以顯著提高生成品質。我們在 11 億個圖像-文字配對上預先訓練了潛在擴散模型,並僅使用幾千張精心挑選的高質量圖像進行微調。所得模型 Emu 在視覺吸引力上的勝率為 82.9%,相較於僅預先訓練的對應模型。與最先進的 SDXLv1.0 相比,Emu 在視覺吸引力上在 PartiPrompts 標準和我們基於真實世界文圖模型使用情況的 Open User Input 基準測試中分別有 68.4% 和 71.3% 的偏好。此外,我們展示品質微調是一種通用方法,對於其他架構,包括像素擴散和遮罩生成變壓器模型,也同樣有效。
English
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.
PDF329December 15, 2024