FiT：灵活视觉Transformer用于扩散模型

摘要

自然具有无限的分辨率。在这个现实背景下，现有的扩散模型，如扩散Transformer，在处理超出其训练域的图像分辨率时经常面临挑战。为了克服这一局限性，我们提出了灵活视觉Transformer（FiT），这是一种专门设计用于生成具有无限制分辨率和长宽比的图像的Transformer架构。与将图像视为静态分辨率网格的传统方法不同，FiT将图像概念化为动态大小的令牌序列。这种视角使得训练策略更加灵活，能够在训练和推理阶段轻松适应各种长宽比，从而促进分辨率泛化，并消除由图像裁剪引起的偏见。通过精心调整的网络结构和集成无需训练的外推技术，FiT在分辨率外推生成方面表现出卓越的灵活性。全面的实验证明了FiT在广泛分辨率范围内的出色性能，展示了其在训练分辨率分布内外的有效性。存储库位于https://github.com/whlzy/FiT。

English

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.

FiT：灵活视觉Transformer用于扩散模型

FiT: Flexible Vision Transformer for Diffusion Model

摘要

Support