T2V-Turbo: 混合報酬フィードバックによるビデオ一貫性モデルの品質ボトルネックの打破

要旨

拡散ベースのテキストからビデオ（T2V）生成モデルは大きな成功を収めているものの、反復的なサンプリングプロセスの遅い速度が依然として課題となっています。この課題に対処するため、高速な推論を可能にする一貫性モデルが提案されていますが、サンプル品質の低下という代償を伴います。本研究では、ビデオ一貫性モデル（VCM）の品質ボトルネックを打破し、高速かつ高品質なビデオ生成を実現することを目指します。我々は、事前学習済みのT2Vモデルの一貫性蒸留（CD）プロセスに、微分可能な報酬モデルの混合からのフィードバックを統合したT2V-Turboを提案します。特に、CD損失の計算に自然に伴う単一ステップ生成に関連する報酬を直接最適化することで、反復サンプリングプロセスを通じた勾配の逆伝播に伴うメモリ制約を効果的に回避します。注目すべきは、我々のT2V-Turboによる4ステップ生成が、VBenchにおいて最高の総合スコアを達成し、Gen-2やPikaを凌駕したことです。さらに、人間による評価を実施し、T2V-Turboの4ステップ生成が、教師モデルの50ステップDDIMサンプルよりも好まれることを確認しました。これは、ビデオ生成品質を向上させながら、10倍以上の高速化を実現したことを示しています。

English

Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve both fast and high-quality video generation. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.

T2V-Turbo: 混合報酬フィードバックによるビデオ一貫性モデルの品質ボトルネックの打破

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

要旨

Support