FiT: 확산 모델을 위한 유연한 비전 트랜스포머

초록

자연은 무한한 해상도 자유를 지니고 있다. 이러한 현실 속에서, Diffusion Transformers와 같은 기존의 확산 모델들은 훈련된 도메인 외부의 이미지 해상도를 처리할 때 종종 어려움에 직면한다. 이러한 한계를 극복하기 위해, 본 연구에서는 무제한 해상도와 종횡비를 가진 이미지를 생성하기 위해 특별히 설계된 트랜스포머 아키텍처인 Flexible Vision Transformer(FiT)를 제안한다. 기존의 방법들이 이미지를 고정된 해상도의 격자로 인식하는 것과 달리, FiT는 이미지를 동적으로 크기가 조정되는 토큰의 시퀀스로 개념화한다. 이러한 관점은 훈련 및 추론 단계에서 다양한 종횡비에 쉽게 적응할 수 있는 유연한 훈련 전략을 가능하게 하여, 해상도 일반화를 촉진하고 이미지 크롭으로 인한 편향을 제거한다. 세심하게 조정된 네트워크 구조와 훈련이 필요 없는 외삽 기법의 통합을 통해, FiT는 해상도 외삽 생성에서 뛰어난 유연성을 보인다. 포괄적인 실험을 통해 FiT는 광범위한 해상도 범위에서 탁월한 성능을 보이며, 훈련 해상도 분포 내외에서 모두 효과적임을 입증한다. 저장소는 https://github.com/whlzy/FiT에서 확인할 수 있다.

English

Nature is infinitely resolution-free. In the context of this reality, existing diffusion models, such as Diffusion Transformers, often face challenges when processing image resolutions outside of their trained domain. To overcome this limitation, we present the Flexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios. Unlike traditional methods that perceive images as static-resolution grids, FiT conceptualizes images as sequences of dynamically-sized tokens. This perspective enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping. Enhanced by a meticulously adjusted network structure and the integration of training-free extrapolation techniques, FiT exhibits remarkable flexibility in resolution extrapolation generation. Comprehensive experiments demonstrate the exceptional performance of FiT across a broad range of resolutions, showcasing its effectiveness both within and beyond its training resolution distribution. Repository available at https://github.com/whlzy/FiT.

FiT: 확산 모델을 위한 유연한 비전 트랜스포머

FiT: Flexible Vision Transformer for Diffusion Model

초록

Support