STARFlow: 高解像度画像合成のための潜在正規化フローのスケーリング

要旨

本研究では、高解像度画像合成において強力な性能を発揮する、正規化フローに基づくスケーラブルな生成モデル「STARFlow」を提案する。STARFlowの中核は、Transformer Autoregressive Flow（TARFlow）であり、正規化フローの表現力と自己回帰型Transformerの構造化モデリング能力を組み合わせたものである。まず、TARFlowが連続分布をモデル化するための理論的な普遍性を確立する。この基盤に基づき、スケーラビリティを大幅に向上させるためのいくつかの重要なアーキテクチャおよびアルゴリズムの革新を導入する：（1）深層-浅層設計。ここでは、深層Transformerブロックがモデルの表現能力の大部分を担い、計算効率が高くながらも大幅な利益をもたらす少数の浅層Transformerブロックが補完する。（2）事前学習済みオートエンコーダの潜在空間でのモデル化。これは、ピクセルレベルでの直接的なモデル化よりも効果的であることが証明されている。（3）サンプル品質を大幅に向上させる新しいガイダンスアルゴリズム。重要な点として、本モデルはエンドツーエンドの正規化フローであり、離散化なしに連続空間での正確な最尤学習を可能にする。STARFlowは、クラス条件付きおよびテキスト条件付き画像生成タスクの両方で競争力のある性能を達成し、サンプル品質において最新の拡散モデルに迫る。我々の知る限り、この研究は、この規模と解像度で効果的に動作する正規化フローの初めての成功例である。

English

We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.

STARFlow: 高解像度画像合成のための潜在正規化フローのスケーリング

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

要旨

Support