FiTv2：可扩展和改进的灵活视觉Transformer用于扩散模型

摘要

自然是无限分辨率的。在这个现实背景下，现有的扩散模型，如扩散Transformer，在处理超出其训练领域的图像分辨率时通常面临挑战。为了解决这一局限性，我们将图像概念化为具有动态尺寸的令牌序列，而不是传统方法将图像视为固定分辨率的网格。这种视角实现了一种灵活的训练策略，可以在训练和推断过程中无缝地适应各种长宽比，从而促进分辨率泛化并消除图像裁剪引入的偏差。基于此，我们提出了灵活视觉Transformer（FiT），这是一种专门设计用于生成具有无限制分辨率和长宽比的图像的Transformer架构。我们进一步升级了FiT为FiTv2，具有几个创新设计，包括查询-键向量归一化、AdaLN-LoRA模块、矫正流调度器和Logit-Normal采样器。通过精心调整的网络结构增强，FiTv2表现出FiT的2倍收敛速度。当结合先进的无训练外推技术时，FiTv2在分辨率外推和多样分辨率生成方面表现出卓越的适应性。此外，我们对FiTv2模型的可扩展性进行的探索显示，更大的模型表现出更好的计算效率。此外，我们引入了一种高分辨率生成的高效后训练策略，以适应预训练模型。全面的实验表明FiTv2在广泛分辨率范围内表现出卓越的性能。我们已经在https://github.com/whlzy/FiT发布了所有代码和模型，以促进对于任意分辨率图像生成的扩散Transformer模型的探索。

English

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To address this limitation, we conceptualize images as sequences of tokens with dynamic sizes, rather than traditional methods that perceive images as fixed-resolution grids. This perspective enables a flexible training strategy that seamlessly accommodates various aspect ratios during both training and inference, thus promoting resolution generalization and eliminating biases introduced by image cropping. On this basis, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. We further upgrade the FiT to FiTv2 with several innovative designs, includingthe Query-Key vector normalization, the AdaLN-LoRA module, a rectified flow scheduler, and a Logit-Normal sampler. Enhanced by a meticulously adjusted network structure, FiTv2 exhibits 2times convergence speed of FiT. When incorporating advanced training-free extrapolation techniques, FiTv2 demonstrates remarkable adaptability in both resolution extrapolation and diverse resolution generation. Additionally, our exploration of the scalability of the FiTv2 model reveals that larger models exhibit better computational efficiency. Furthermore, we introduce an efficient post-training strategy to adapt a pre-trained model for the high-resolution generation. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions. We have released all the codes and models at https://github.com/whlzy/FiT to promote the exploration of diffusion transformer models for arbitrary-resolution image generation.

FiTv2：可扩展和改进的灵活视觉Transformer用于扩散模型

FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model

摘要

Support