貪欲な成長戦略が高解像度ピクセルベースの拡散モデルを可能にする

要旨

我々は、大規模なピクセルベースの画像拡散モデルを効果的に学習するという長年の課題に取り組み、カスケード型の超解像コンポーネントを必要とせずに、大規模で高解像度のモデルを安定して訓練するための非常にシンプルな貪欲な成長手法を導入します。その鍵となる洞察は、コアコンポーネント、すなわちテキストと画像の整合性を担う部分と高解像度レンダリングを担う部分を慎重に事前訓練することから得られました。まず、ダウンサンプリング（アップサンプリング）エンコーダ（デコーダ）を持たない「Shallow UNet」をスケーリングすることの利点を実証します。その深層コア層をスケーリングすることで、整合性、オブジェクト構造、構図が改善されることが示されました。このコアモデルを基盤として、事前訓練された表現の整合性を保ちながら、訓練を安定化し、大規模な高解像度データセットの必要性を減らすことで、高解像度のエンドツーエンドモデルにアーキテクチャを成長させる貪欲アルゴリズムを提案します。これにより、超解像カスケードを必要とせずに高解像度画像を生成できる単一段階のモデルが可能になります。我々の主要な結果は公開データセットに基づいており、追加の正則化スキームなしで最大80億パラメータの非カスケードモデルを訓練できることを示しています。内部データセットで訓練され、カスケードなしで1024x1024画像を生成する我々のフルパイプラインモデル「Vermeer」は、SDXLに対して44.0%対21.4%の人間評価者に好まれました。

English

We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

貪欲な成長戦略が高解像度ピクセルベースの拡散モデルを可能にする

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

要旨

Support