對於高解析度影像合成的矯正流轉換器的擴展

摘要

擴散模型通過將數據從噪音中反向轉換，已成為高維感知數據（如圖像和視頻）的強大生成建模技術。矯正流是一種最近的生成模型形式，它將數據與噪音直接相連。儘管具有更好的理論性質和概念上的簡單性，但它尚未被明確確立為標準實踐。在這項工作中，我們通過將現有的噪音採樣技術偏向於感知相關尺度來改進訓練矯正流模型。通過大規模研究，我們展示了這種方法相對於已建立的擴散公式在高分辨率文本到圖像合成方面的卓越性能。此外，我們提出了一種新型基於變壓器的文本到圖像生成架構，該架構使用獨立的權重來處理兩種模態，並實現了圖像和文本標記之間信息的雙向流動，從而提高了文本理解、排版和人類偏好評分。我們展示了該架構遵循可預測的擴展趨勢，並將較低的驗證損失與通過各種指標和人類評估衡量的改進的文本到圖像合成相關聯。我們最大的模型超越了最先進的模型，並將我們的實驗數據、代碼和模型權重公開提供。

English

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

對於高解析度影像合成的矯正流轉換器的擴展

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

摘要

Support