FiTv2: 拡張性が高く改良された柔軟なビジョン・トランスフォーマーによる拡散モデル

要旨

自然は解像度に限りがありません。この現実の文脈において、Diffusion Transformersなどの既存の拡散モデルは、訓練されたドメイン外の画像解像度を処理する際にしばしば課題に直面します。この制限に対処するために、画像を固定解像度のグリッドとして捉える従来の方法ではなく、画像を動的サイズのトークンのシーケンスとして概念化します。この視点により、柔軟なトレーニング戦略が可能となり、トレーニングおよび推論の両方で様々なアスペクト比をスムーズに適応させることができ、したがって解像度の一般化を促進し、画像のクロッピングによって導入されるバイアスを排除します。この基盤の上で、解像度やアスペクト比に制約のない画像を生成するために特別に設計されたTransformerアーキテクチャであるFlexible Vision Transformer（FiT）を提案します。Query-Keyベクトルの正規化、AdaLN-LoRAモジュール、修正フロースケジューラ、およびLogit-Normalサンプラーなど、いくつかの革新的な設計を含むFiTをFiTv2にアップグレードします。精巧に調整されたネットワーク構造により、FiTv2はFiTの収束速度の2倍を示します。高度なトレーニングフリーの外挿技術を組み込むと、FiTv2は解像度の外挿と多様な解像度の生成の両方で顕著な適応性を示します。さらに、FiTv2モデルのスケーラビリティの探索を行い、より大きなモデルほど計算効率が良いことを明らかにします。さらに、事前トレーニングされたモデルを高解像度生成に適応させるための効果的なポストトレーニング戦略を紹介します。包括的な実験により、様々な解像度でのFiTv2の卓越した性能が示されます。任意の解像度の画像生成のための拡散Transformerモデルの探索を促進するために、すべてのコードとモデルをhttps://github.com/whlzy/FiTで公開しています。

English

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits 2times convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at https://github.com/whlzy/FiT to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.

FiTv2: 拡張性が高く改良された柔軟なビジョン・トランスフォーマーによる拡散モデル

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

要旨

Support