高解像度画像合成のためのRectified Flow Transformersのスケーリング

要旨

拡散モデルは、データからノイズへの順方向の経路を逆転させることでノイズからデータを生成し、画像や動画などの高次元で知覚的なデータに対する強力な生成モデリング技術として登場しました。Rectified flowは、データとノイズを直線的に接続する最近の生成モデルの定式化です。理論的に優れた特性と概念的な単純さを持っているにもかかわらず、まだ標準的な手法として確立されていません。本研究では、知覚的に重要なスケールに偏らせることで、Rectified flowモデルの訓練における既存のノイズサンプリング技術を改善します。大規模な研究を通じて、高解像度のテキストから画像への合成において、このアプローチが確立された拡散モデルの定式化を上回る性能を示すことを実証します。さらに、画像とテキストのトークン間で双方向の情報フローを可能にし、テキスト理解、タイポグラフィ、および人間の嗜好評価を向上させる、2つのモダリティに対して別々の重みを使用する新しいTransformerベースのアーキテクチャを提案します。このアーキテクチャが予測可能なスケーリングトレンドに従い、検証損失の低下が様々な指標と人間の評価によって測定されたテキストから画像への合成の改善と相関することを示します。私たちの最大のモデルは最先端のモデルを上回り、実験データ、コード、およびモデルの重みを公開する予定です。

English

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

高解像度画像合成のためのRectified Flow Transformersのスケーリング

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

要旨

Support