T2V-Turbo-v2: データ、報酬、および条件付きガイダンス設計を通じたビデオ生成モデルの事後トレーニングの強化

要旨

本論文では、事前学習済みのT2Vモデルから高性能な一貫性モデルを抽出することにより、後段トレーニング段階で拡散ベースのテキストからビデオへのモデル（T2V）を強化することに焦点を当てています。提案された手法であるT2V-Turbo-v2は、高品質なトレーニングデータ、報酬モデルフィードバック、および条件付きガイダンスなど、さまざまな監督信号を一貫性蒸留プロセスに統合することで、重要な進歩をもたらします。包括的な除去研究を通じて、特定の学習目標にデータセットを適合させることの重要性と、視覚品質とテキスト-ビデオの整合性の両方を向上させるために多様な報酬モデルから学習する効果を強調しています。さらに、効果的なエネルギー関数を設計して教師ODEソルバーを強化することに焦点を当てた条件付きガイダンス戦略の広大な設計空間を強調しています。訓練データセットから動きのガイダンスを抽出し、それをODEソルバーに組み込むことで、VBenchとT2V-CompBenchからの改善された動き関連メトリクスで生成されたビデオの動き品質を向上させる効果を示しています。経験的に、T2V-Turbo-v2はVBenchでTotalスコア85.13という新たな最高成績を樹立し、Gen-3やKlingなどのプロプライエタリシステムを上回っています。

English

In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench, with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.

T2V-Turbo-v2: データ、報酬、および条件付きガイダンス設計を通じたビデオ生成モデルの事後トレーニングの強化

T2V-Turbo-v2: Enhancing Video Generation Model Post-Training through Data, Reward, and Conditional Guidance Design

要旨

Support