Emu: 건초 더미 속 바늘 찾기 기법을 활용한 이미지 생성 모델 개선

초록

웹 규모의 이미지-텍스트 쌍을 사용하여 텍스트-이미지 모델을 학습시키면 다양한 시각적 개념을 텍스트로부터 생성할 수 있습니다. 그러나 이러한 사전 학습된 모델들은 고도로 미학적인 이미지를 생성하는 데 있어 어려움을 겪는 경우가 많습니다. 이로 인해 사전 학습 이후의 미학적 정렬(aesthetic alignment)이 필요하게 됩니다. 본 논문에서는 사전 학습된 모델이 시각적 개념의 일반성을 유지하면서도 고도로 시각적으로 매력적인 이미지만을 생성하도록 효과적으로 유도하는 품질 튜닝(quality-tuning)을 제안합니다. 우리의 핵심 통찰은, 놀랍도록 작지만 극도로 시각적으로 매력적인 이미지 세트를 사용한 지도 미세 조정(supervised fine-tuning)이 생성 품질을 크게 향상시킬 수 있다는 것입니다. 우리는 11억 개의 이미지-텍스트 쌍으로 잠재 확산 모델(latent diffusion model)을 사전 학습시키고, 수천 개의 신중하게 선별된 고품질 이미지로 미세 조정을 수행했습니다. 그 결과로 얻은 모델인 Emu는 사전 학습만 수행된 모델 대비 82.9%의 승률을 달성했습니다. 최신 기술인 SDXLv1.0과 비교했을 때, Emu는 표준 PartiPrompts와 텍스트-이미지 모델의 실제 사용을 기반으로 한 우리의 Open User Input 벤치마크에서 각각 68.4%와 71.3%의 선호도를 보였습니다. 또한, 품질 튜닝이 픽셀 확산(pixel diffusion) 및 마스크 생성 트랜스포머(masked generative transformer) 모델을 포함한 다른 아키텍처에서도 효과적인 일반적인 접근 방식임을 보여줍니다.

English

Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text. However, these pre-trained models often face challenges when it comes to generating highly aesthetic images. This creates the need for aesthetic alignment post pre-training. In this paper, we propose quality-tuning to effectively guide a pre-trained model to exclusively generate highly visually appealing images, while maintaining generality across visual concepts. Our key insight is that supervised fine-tuning with a set of surprisingly small but extremely visually appealing images can significantly improve the generation quality. We pre-train a latent diffusion model on 1.1 billion image-text pairs and fine-tune it with only a few thousand carefully selected high-quality images. The resulting model, Emu, achieves a win rate of 82.9% compared with its pre-trained only counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred 68.4% and 71.3% of the time on visual appeal on the standard PartiPrompts and our Open User Input benchmark based on the real-world usage of text-to-image models. In addition, we show that quality-tuning is a generic approach that is also effective for other architectures, including pixel diffusion and masked generative transformer models.

Emu: 건초 더미 속 바늘 찾기 기법을 활용한 이미지 생성 모델 개선

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

초록

Support