ChatPaper.aiChatPaper

T2V-Turbo:通過混合獎勵反饋打破視頻一致性模型的質量瓶頸

T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback

May 29, 2024
作者: Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang
cs.AI

摘要

基於擴散的文本到視頻(T2V)模型取得了顯著成功,但仍受制於其迭代採樣過程的緩慢採樣速度。為應對這一挑戰,一些一致性模型被提出以促進快速推斷,儘管這是以樣本質量為代價。在這項工作中,我們旨在突破視頻一致性模型(VCM)的質量瓶頸,實現快速且高質量的視頻生成。我們引入了T2V-Turbo,將來自可微分獎勵模型混合的反饋整合到預訓練T2V模型的一致性蒸餾(CD)過程中。值得注意的是,我們直接優化與單步生成相關的獎勵,這些獎勵自然地產生於計算CD損失時,有效地通過繞過通過迭代採樣過程反向傳播梯度所施加的內存限制。顯著地,我們的T2V-Turbo的4步生成在VBench上獲得了最高總分,甚至超過了Gen-2和Pika。我們進一步進行人類評估以證實結果,驗證了我們的T2V-Turbo的4步生成優於其教師模型的50步DDIM樣本,實現了超過十倍的加速同時提高了視頻生成質量。
English
Diffusion-based text-to-video (T2V) models have achieved significant success but continue to be hampered by the slow sampling speed of their iterative sampling processes. To address the challenge, consistency models have been proposed to facilitate fast inference, albeit at the cost of sample quality. In this work, we aim to break the quality bottleneck of a video consistency model (VCM) to achieve both fast and high-quality video generation. We introduce T2V-Turbo, which integrates feedback from a mixture of differentiable reward models into the consistency distillation (CD) process of a pre-trained T2V model. Notably, we directly optimize rewards associated with single-step generations that arise naturally from computing the CD loss, effectively bypassing the memory constraints imposed by backpropagating gradients through an iterative sampling process. Remarkably, the 4-step generations from our T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and Pika. We further conduct human evaluations to corroborate the results, validating that the 4-step generations from our T2V-Turbo are preferred over the 50-step DDIM samples from their teacher models, representing more than a tenfold acceleration while improving video generation quality.

Summary

AI-Generated Summary

PDF221December 12, 2024