ChatPaper.aiChatPaper

GenTron:深入研究扩散Transformer用于图像和视频生成

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

December 7, 2023
作者: Shoufa Chen, Mengmeng Xu, Jiawei Ren, Yuren Cong, Sen He, Yanping Xie, Animesh Sinha, Ping Luo, Tao Xiang, Juan-Manuel Perez-Rua
cs.AI

摘要

在本研究中,我们探讨了基于Transformer的扩散模型用于图像和视频生成。尽管Transformer架构在各个领域占据主导地位,因其灵活性和可扩展性,但在视觉生成领域,主要使用基于CNN的U-Net架构,特别是在基于扩散的模型中。我们引入了GenTron,这是一系列采用基于Transformer的扩散的生成模型,以填补这一空白。我们的初始步骤是将Diffusion Transformers(DiTs)从类到文本条件适应,这个过程涉及对条件机制进行彻底的经验性探索。然后,我们将GenTron从约9亿扩展到超过30亿参数,观察到视觉质量显著提高。此外,我们将GenTron扩展到文本到视频生成,引入了新颖的无运动引导以增强视频质量。在与SDXL的人类评估中,GenTron在视觉质量方面获得51.1%的胜率(19.8%的平局率),在文本对齐方面获得42.3%的胜率(42.9%的平局率)。GenTron在T2I-CompBench中也表现出色,突显了其在构成生成方面的优势。我们相信这项工作将提供有意义的见解,并为未来研究提供宝贵的参考。
English
In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.
PDF130December 15, 2024