FiT：拡散モデル向けの柔軟なVision Transformer

要旨

自然は無限に解像度の制約がない。この現実を踏まえると、既存の拡散モデル、例えばDiffusion Transformersなどは、訓練された領域外の画像解像度を処理する際にしばしば課題に直面する。この制限を克服するため、我々はFlexible Vision Transformer（FiT）を提案する。これは、制限のない解像度とアスペクト比で画像を生成するために特別に設計されたトランスフォーマーアーキテクチャである。従来の方法が画像を静的な解像度のグリッドとして捉えるのに対し、FiTは画像を動的にサイズが変化するトークンのシーケンスとして概念化する。この視点により、訓練と推論の両フェーズで多様なアスペクト比に容易に適応する柔軟な訓練戦略が可能となり、解像度の一般化を促進し、画像のクロップによって引き起こされるバイアスを排除する。注意深く調整されたネットワーク構造と訓練不要の外挿技術の統合により、FiTは解像度外挿生成において顕著な柔軟性を示す。包括的な実験により、FiTが広範な解像度範囲で優れた性能を発揮し、訓練解像度分布の内外においてその有効性を実証している。リポジトリはhttps://github.com/whlzy/FiTで公開されている。

English

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.

FiT：拡散モデル向けの柔軟なVision Transformer

FiT: Flexible Vision Transformer for Diffusion Model

要旨

Support