FiT:靈活視覺Transformer用於擴散模型
FiT: Flexible Vision Transformer for Diffusion Model
February 19, 2024
作者: Zeyu Lu, Zidong Wang, Di Huang, Chengyue Wu, Xihui Liu, Wanli Ouyang, Lei Bai
cs.AI
摘要
自然是無限解析度的。在這個現實情境中,現有的擴散模型,如擴散Transformer,在處理超出其訓練領域的圖像解析度時常常面臨挑戰。為了克服這個限制,我們提出了彈性視覺Transformer(FiT),這是一種專門設計用於生成具有無限制解析度和長寬比的圖像的Transformer架構。與將圖像視為靜態解析度網格的傳統方法不同,FiT將圖像概念化為動態大小的標記序列。這種觀點使得一種靈活的訓練策略成為可能,能夠在訓練和推斷階段輕鬆適應各種長寬比,從而促進解析度泛化,消除由圖像裁剪引起的偏見。通過精心調整的網絡結構和集成了無需訓練的外推技術,FiT在解析度外推生成方面展現出卓越的靈活性。全面的實驗證明了FiT在廣泛範圍的解析度上的卓越性能,展示了其在訓練解析度分佈範圍內外的有效性。存儲庫位於https://github.com/whlzy/FiT。
English
Nature is infinitely resolution-free. In the context of this reality,
existing diffusion models, such as Diffusion Transformers, often face
challenges when processing image resolutions outside of their trained domain.
To overcome this limitation, we present the Flexible Vision Transformer (FiT),
a transformer architecture specifically designed for generating images with
unrestricted resolutions and aspect ratios. Unlike traditional methods that
perceive images as static-resolution grids, FiT conceptualizes images as
sequences of dynamically-sized tokens. This perspective enables a flexible
training strategy that effortlessly adapts to diverse aspect ratios during both
training and inference phases, thus promoting resolution generalization and
eliminating biases induced by image cropping. Enhanced by a meticulously
adjusted network structure and the integration of training-free extrapolation
techniques, FiT exhibits remarkable flexibility in resolution extrapolation
generation. Comprehensive experiments demonstrate the exceptional performance
of FiT across a broad range of resolutions, showcasing its effectiveness both
within and beyond its training resolution distribution. Repository available at
https://github.com/whlzy/FiT.Summary
AI-Generated Summary