GenTron: 画像および動画生成のためのDiffusion Transformerの深層探求

要旨

本研究では、画像および動画生成のためのTransformerベースの拡散モデルを探求します。Transformerアーキテクチャはその柔軟性とスケーラビリティから様々な分野で支配的であるにもかかわらず、視覚生成の領域では主にCNNベースのU-Netアーキテクチャ、特に拡散ベースのモデルが使用されています。このギャップを埋めるため、Transformerベースの拡散を採用した生成モデルファミリーであるGenTronを導入します。最初のステップとして、Diffusion Transformers（DiTs）をクラス条件付けからテキスト条件付けに適応させ、条件付けメカニズムの徹底的な実証的探求を行いました。次に、GenTronを約900Mから3B以上のパラメータにスケールアップし、視覚品質の大幅な向上を観察しました。さらに、GenTronをテキストから動画生成に拡張し、動画品質を向上させるための新しいモーションフリーガイダンスを組み込みました。SDXLとの人間評価では、GenTronは視覚品質で51.1%の勝率（19.8%の引き分け率）、テキストアライメントで42.3%の勝率（42.9%の引き分け率）を達成しました。GenTronはまた、T2I-CompBenchでも優れた性能を示し、構成的生成における強みを強調しています。本研究が有意義な洞察を提供し、将来の研究にとって貴重な参考資料となることを信じています。

English

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

GenTron: 画像および動画生成のためのDiffusion Transformerの深層探求

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

要旨

Support