Playground v2.5: 三个见解，提升文本到图像生成中的美学质量

摘要

在这项工作中，我们分享了三个见解，以实现文本到图像生成模型的最新美学质量。我们专注于模型改进的三个关键方面：增强色彩和对比度，改善跨多个宽高比的生成，以及改善以人为中心的细节。首先，我们深入探讨了在训练扩散模型中噪声时间表的重要性，展示了它对现实感和视觉保真度的深远影响。其次，我们解决了在图像生成中适应各种宽高比的挑战，强调准备平衡的分桶数据集的重要性。最后，我们调查了将模型输出与人类偏好对齐的关键作用，确保生成的图像与人类感知期望 resonates。通过广泛的分析和实验，Playground v2.5 在各种条件和宽高比下展示了最新美学质量的性能，优于诸如 SDXL 和 Playground v2 等广泛使用的开源模型，以及 DALLE 3 和 Midjourney v5.2 等闭源商业系统。我们的模型是开源的，我们希望 Playground v2.5 的发展为旨在提升基于扩散的图像生成模型的美学质量的研究人员提供有价值的指导。

English

In this work, we share three insights for achieving state-of-the-art aesthetic quality in text-to-image generative models. We focus on three critical aspects for model improvement: enhancing color and contrast, improving generation across multiple aspect ratios, and improving human-centric fine details. First, we delve into the significance of the noise schedule in training a diffusion model, demonstrating its profound impact on realism and visual fidelity. Second, we address the challenge of accommodating various aspect ratios in image generation, emphasizing the importance of preparing a balanced bucketed dataset. Lastly, we investigate the crucial role of aligning model outputs with human preferences, ensuring that generated images resonate with human perceptual expectations. Through extensive analysis and experiments, Playground v2.5 demonstrates state-of-the-art performance in terms of aesthetic quality under various conditions and aspect ratios, outperforming both widely-used open-source models like SDXL and Playground v2, and closed-source commercial systems such as DALLE 3 and Midjourney v5.2. Our model is open-source, and we hope the development of Playground v2.5 provides valuable guidelines for researchers aiming to elevate the aesthetic quality of diffusion-based image generation models.

Playground v2.5: 三个见解，提升文本到图像生成中的美学质量

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

摘要

Support