Playground v2.5: テキストから画像生成における美的品質向上に向けた3つの洞察

要旨

本研究では、テキストから画像を生成するモデルにおいて、最先端の美的品質を実現するための3つの洞察を共有します。モデル改善のための3つの重要な側面に焦点を当てます：色とコントラストの向上、複数のアスペクト比にわたる生成の改善、そして人間中心の細部の改善です。まず、拡散モデルの訓練におけるノイズスケジュールの重要性について掘り下げ、それがリアリズムと視覚的忠実度に与える深い影響を実証します。次に、画像生成における様々なアスペクト比に対応する課題に取り組み、バランスの取れたバケットデータセットを準備することの重要性を強調します。最後に、モデルの出力を人間の好みに合わせることの重要な役割を調査し、生成された画像が人間の知覚的期待に共鳴することを保証します。広範な分析と実験を通じて、Playground v2.5は、様々な条件やアスペクト比において美的品質の面で最先端の性能を示し、SDXLやPlayground v2のような広く使われているオープンソースモデルや、DALLE 3やMidjourney v5.2のようなクローズドソースの商用システムを凌駕しています。私たちのモデルはオープンソースであり、Playground v2.5の開発が、拡散ベースの画像生成モデルの美的品質を向上させようとする研究者にとって貴重な指針を提供することを願っています。

English

In this work, we share three insights for achieving state-of-the-art aesthetic quality in text-to-image generative models. We focus on three critical aspects for model improvement: enhancing color and contrast, improving generation across multiple aspect ratios, and improving human-centric fine details. First, we delve into the significance of the noise schedule in training a diffusion model, demonstrating its profound impact on realism and visual fidelity. Second, we address the challenge of accommodating various aspect ratios in image generation, emphasizing the importance of preparing a balanced bucketed dataset. Lastly, we investigate the crucial role of aligning model outputs with human preferences, ensuring that generated images resonate with human perceptual expectations. Through extensive analysis and experiments, Playground v2.5 demonstrates state-of-the-art performance in terms of aesthetic quality under various conditions and aspect ratios, outperforming both widely-used open-source models like SDXL and Playground v2, and closed-source commercial systems such as DALLE 3 and Midjourney v5.2. Our model is open-source, and we hope the development of Playground v2.5 provides valuable guidelines for researchers aiming to elevate the aesthetic quality of diffusion-based image generation models.

Playground v2.5: テキストから画像生成における美的品質向上に向けた3つの洞察

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

要旨

Support