STARFlow：面向高分辨率图像生成的扩展型潜在归一化流

摘要

我们提出了STARFlow，这是一种基于归一化流的可扩展生成模型，在高分辨率图像合成中表现出色。STARFlow的核心是Transformer自回归流（TARFlow），它结合了归一化流的强大表达能力与自回归Transformer的结构化建模能力。我们首先从理论上证明了TARFlow在建模连续分布方面的普适性。在此基础上，我们引入了多项关键的架构与算法创新，显著提升了模型的可扩展性：（1）深浅层设计，其中深层Transformer模块承载了模型的主要表示能力，辅以少量计算高效但效果显著的浅层Transformer模块；（2）在预训练自编码器的潜在空间中进行建模，相比直接像素级建模更为有效；（3）一种新颖的引导算法，大幅提升了样本质量。重要的是，我们的模型仍保持为端到端的归一化流，能够在连续空间中进行精确的最大似然训练，无需离散化处理。STARFlow在类别条件与文本条件图像生成任务中均展现出竞争力，样本质量接近最先进的扩散模型。据我们所知，这是首次成功展示归一化流在此规模和分辨率下有效运作的研究。

English

We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.

STARFlow：面向高分辨率图像生成的扩展型潜在归一化流

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

摘要

Support