STARFlow：面向高分辨率圖像合成的潛在歸一化流擴展技術

摘要

我們提出STARFlow，這是一種基於正規化流（normalizing flows）的可擴展生成模型，在高分辨率圖像合成中展現出強勁性能。STARFlow的核心是Transformer自回歸流（TARFlow），它結合了正規化流的表達能力與自回歸Transformer的結構化建模能力。我們首先確立了TARFlow在建模連續分佈上的理論普適性。基於此，我們引入了幾項關鍵的架構與算法創新，顯著提升了模型的可擴展性：（1）深淺層設計，其中一個深層Transformer塊承載了模型的大部分表示能力，輔以少數計算效率高且效果顯著的淺層Transformer塊；（2）在預訓練自編碼器的潛在空間中進行建模，這比直接進行像素級建模更為有效；（3）一種新穎的引導算法，大幅提升了樣本質量。關鍵在於，我們的模型仍保持為端到端的正規化流，使得在連續空間中無需離散化即可進行精確的最大似然訓練。STARFlow在類別條件與文本條件的圖像生成任務中均達到了競爭性的性能，在樣本質量上接近最先進的擴散模型。據我們所知，這是首次成功展示正規化流在如此規模與分辨率下有效運作的研究。

English

We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.

STARFlow：面向高分辨率圖像合成的潛在歸一化流擴展技術

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

摘要

Support