为高分辨率图像合成扩展修正流变换器

摘要

扩散模型通过将数据从噪声中反向转换而创建数据，已成为处理高维感知数据（如图像和视频）的强大生成建模技术。矫正流是一种最近提出的生成模型形式，它将数据和噪声直接连接在一条直线上。尽管具有更好的理论性质和概念上的简单性，但它尚未被明确确立为标准实践。在这项工作中，我们改进了现有的噪声采样技术，用于训练矫正流模型，通过将其偏向感知相关尺度。通过大规模研究，我们展示了这种方法相对于已建立的扩散公式在高分辨率文本到图像合成中的卓越性能。此外，我们提出了一种新颖的基于变压器的架构，用于文本到图像生成，该架构使用两种模态的独立权重，并实现了图像和文本标记之间信息的双向流动，提高了文本理解、排版和人类偏好评分。我们证明了这种架构遵循可预测的缩放趋势，并将较低的验证损失与通过各种指标和人类评估衡量的改进的文本到图像合成相关联。我们的最大模型胜过了最先进的模型，并将公开提供我们的实验数据、代码和模型权重。

English

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

为高分辨率图像合成扩展修正流变换器

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

摘要

Support