Playground v2.5：三個關於提升文本到圖像生成中美學品質的見解

摘要

在這份工作中，我們分享了三個洞見，以實現文本到圖像生成模型的最先進美學質量。我們專注於模型改進的三個關鍵方面：增強色彩和對比度、改善跨多個長寬比的生成，以及提升以人為中心的細節。首先，我們深入探討在訓練擴散模型時噪聲時間表的重要性，展示其對現實感和視覺保真度的深遠影響。其次，我們解決了在圖像生成中應對各種長寬比的挑戰，強調準備平衡的分桶數據集的重要性。最後，我們研究了將模型輸出與人類偏好對齊的關鍵作用，確保生成的圖像與人類感知期望 resonates。通過廣泛的分析和實驗，Playground v2.5 在各種條件和長寬比下展示了最先進的美學質量表現，優於廣泛使用的開源模型如 SDXL 和 Playground v2，以及閉源商業系統如 DALLE 3 和 Midjourney v5.2。我們的模型是開源的，希望 Playground v2.5 的開發為致力於提升基於擴散的圖像生成模型美學質量的研究人員提供寶貴的指南。

English

In this work, we share three insights for achieving state-of-the-art aesthetic quality in text-to-image generative models. We focus on three critical aspects for model improvement: enhancing color and contrast, improving generation across multiple aspect ratios, and improving human-centric fine details. First, we delve into the significance of the noise schedule in training a diffusion model, demonstrating its profound impact on realism and visual fidelity. Second, we address the challenge of accommodating various aspect ratios in image generation, emphasizing the importance of preparing a balanced bucketed dataset. Lastly, we investigate the crucial role of aligning model outputs with human preferences, ensuring that generated images resonate with human perceptual expectations. Through extensive analysis and experiments, Playground v2.5 demonstrates state-of-the-art performance in terms of aesthetic quality under various conditions and aspect ratios, outperforming both widely-used open-source models like SDXL and Playground v2, and closed-source commercial systems such as DALLE 3 and Midjourney v5.2. Our model is open-source, and we hope the development of Playground v2.5 provides valuable guidelines for researchers aiming to elevate the aesthetic quality of diffusion-based image generation models.

Playground v2.5：三個關於提升文本到圖像生成中美學品質的見解

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

摘要

Support