GenTron: 이미지 및 비디오 생성을 위한 Diffusion Transformer의 심층 탐구

초록

본 연구에서는 이미지 및 비디오 생성을 위한 Transformer 기반 확산 모델을 탐구합니다. Transformer 아키텍처가 유연성과 확장성으로 인해 다양한 분야에서 주도적인 위치를 차지하고 있음에도 불구하고, 시각적 생성 분야에서는 주로 CNN 기반 U-Net 아키텍처, 특히 확산 기반 모델이 활용되고 있습니다. 이러한 격차를 해결하기 위해 우리는 Transformer 기반 확산을 사용하는 생성 모델 패밀리인 GenTron을 소개합니다. 첫 번째 단계로, 우리는 클래스 조건에서 텍스트 조건으로 Diffusion Transformers(DiTs)를 적용하는 과정을 통해 조건 메커니즘에 대한 철저한 실험적 탐구를 수행했습니다. 이후 GenTron을 약 900M에서 3B 이상의 파라미터로 확장하면서 시각적 품질의 상당한 개선을 관찰했습니다. 더 나아가, 우리는 GenTron을 텍스트-투-비디오 생성으로 확장하고, 비디오 품질을 향상시키기 위한 새로운 모션-프리 가이던스를 도입했습니다. SDXL과의 인간 평가에서 GenTron은 시각적 품질에서 51.1%의 승률(19.8% 무승부)을, 텍스트 정렬에서 42.3%의 승률(42.9% 무승부)을 달성했습니다. 또한 GenTron은 T2I-CompBench에서도 우수한 성능을 보이며, 구성적 생성에서의 강점을 입증했습니다. 우리는 이 연구가 의미 있는 통찰을 제공하고 향후 연구에 귀중한 참고 자료가 될 것이라고 믿습니다.

English

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

GenTron: 이미지 및 비디오 생성을 위한 Diffusion Transformer의 심층 탐구

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

초록

Support