GenTron：深入探討擴散Transformer在圖像和視頻生成中的應用

摘要

本研究探索基於Transformer的擴散模型，用於影像和視頻生成。儘管Transformer架構在各個領域佔主導地位，因其靈活性和可擴展性，但在視覺生成領域主要使用基於CNN的U-Net架構，特別是在基於擴散的模型中。我們引入GenTron，一系列採用Transformer-based擴散的生成模型，以填補這一空白。我們的初始步驟是將Diffusion Transformers（DiTs）從類別調整為文本條件，這一過程包括對條件機制進行深入的實證探索。然後，我們將GenTron從約900M擴展到超過3B參數，觀察到視覺質量顯著提升。此外，我們將GenTron擴展到文本到視頻生成，並納入新穎的無運動引導以提升視頻質量。在與SDXL的人類評估中，GenTron在視覺質量方面取得51.1%的勝率（19.8%的平局率），在文本對齊方面取得42.3%的勝率（42.9%的平局率）。GenTron在T2I-CompBench中也表現出色，突顯其在組合生成方面的優勢。我們相信這項工作將提供有意義的見解，並成為未來研究的寶貴參考。

English

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

GenTron：深入探討擴散Transformer在圖像和視頻生成中的應用

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

摘要

Support