FiT：靈活視覺Transformer用於擴散模型

摘要

自然是無限解析度的。在這個現實情境中，現有的擴散模型，如擴散Transformer，在處理超出其訓練領域的圖像解析度時常常面臨挑戰。為了克服這個限制，我們提出了彈性視覺Transformer（FiT），這是一種專門設計用於生成具有無限制解析度和長寬比的圖像的Transformer架構。與將圖像視為靜態解析度網格的傳統方法不同，FiT將圖像概念化為動態大小的標記序列。這種觀點使得一種靈活的訓練策略成為可能，能夠在訓練和推斷階段輕鬆適應各種長寬比，從而促進解析度泛化，消除由圖像裁剪引起的偏見。通過精心調整的網絡結構和集成了無需訓練的外推技術，FiT在解析度外推生成方面展現出卓越的靈活性。全面的實驗證明了FiT在廣泛範圍的解析度上的卓越性能，展示了其在訓練解析度分佈範圍內外的有效性。存儲庫位於https://github.com/whlzy/FiT。

English

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.

FiT：靈活視覺Transformer用於擴散模型

FiT: Flexible Vision Transformer for Diffusion Model

摘要

Support