對於高解析度影像合成的矯正流轉換器的擴展
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
March 5, 2024
作者: Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach
cs.AI
摘要
擴散模型通過將數據從噪音中反向轉換,已成為高維感知數據(如圖像和視頻)的強大生成建模技術。矯正流是一種最近的生成模型形式,它將數據與噪音直接相連。儘管具有更好的理論性質和概念上的簡單性,但它尚未被明確確立為標準實踐。在這項工作中,我們通過將現有的噪音採樣技術偏向於感知相關尺度來改進訓練矯正流模型。通過大規模研究,我們展示了這種方法相對於已建立的擴散公式在高分辨率文本到圖像合成方面的卓越性能。此外,我們提出了一種新型基於變壓器的文本到圖像生成架構,該架構使用獨立的權重來處理兩種模態,並實現了圖像和文本標記之間信息的雙向流動,從而提高了文本理解、排版和人類偏好評分。我們展示了該架構遵循可預測的擴展趨勢,並將較低的驗證損失與通過各種指標和人類評估衡量的改進的文本到圖像合成相關聯。我們最大的模型超越了最先進的模型,並將我們的實驗數據、代碼和模型權重公開提供。
English
Diffusion models create data from noise by inverting the forward paths of
data towards noise and have emerged as a powerful generative modeling technique
for high-dimensional, perceptual data such as images and videos. Rectified flow
is a recent generative model formulation that connects data and noise in a
straight line. Despite its better theoretical properties and conceptual
simplicity, it is not yet decisively established as standard practice. In this
work, we improve existing noise sampling techniques for training rectified flow
models by biasing them towards perceptually relevant scales. Through a
large-scale study, we demonstrate the superior performance of this approach
compared to established diffusion formulations for high-resolution
text-to-image synthesis. Additionally, we present a novel transformer-based
architecture for text-to-image generation that uses separate weights for the
two modalities and enables a bidirectional flow of information between image
and text tokens, improving text comprehension, typography, and human preference
ratings. We demonstrate that this architecture follows predictable scaling
trends and correlates lower validation loss to improved text-to-image synthesis
as measured by various metrics and human evaluations. Our largest models
outperform state-of-the-art models, and we will make our experimental data,
code, and model weights publicly available.