T2V-Turbo:通过混合奖励反馈突破视频一致性模型的质量瓶颈
T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback
May 29, 2024
作者: Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, William Yang Wang
cs.AI
摘要
基于扩散的文本到视频(T2V)模型取得了显著成功,但仍然受到迭代采样过程速度缓慢的阻碍。为了解决这一挑战,一些一致性模型被提出以促进快速推理,尽管以牺牲样本质量为代价。在这项工作中,我们旨在突破视频一致性模型(VCM)的质量瓶颈,实现快速且高质量的视频生成。我们引入了T2V-Turbo,它将来自可微分奖励模型混合的反馈集成到预训练T2V模型的一致性蒸馏(CD)过程中。值得注意的是,我们直接优化与单步生成相关的奖励,这些奖励自然产生于计算CD损失,有效地绕过了通过迭代采样过程反向传播梯度所施加的内存限制。值得注意的是,我们的T2V-Turbo生成的4步视频在VBench上获得了最高的总分,甚至超过了Gen-2和Pika。我们进一步进行人类评估以证实结果,验证了我们的T2V-Turbo生成的4步视频优于它们的教师模型的50步DDIM样本,实现了十倍以上的加速同时提高了视频生成质量。
English
Diffusion-based text-to-video (T2V) models have achieved significant success
but continue to be hampered by the slow sampling speed of their iterative
sampling processes. To address the challenge, consistency models have been
proposed to facilitate fast inference, albeit at the cost of sample quality. In
this work, we aim to break the quality bottleneck of a video consistency model
(VCM) to achieve both fast and high-quality video generation. We
introduce T2V-Turbo, which integrates feedback from a mixture of differentiable
reward models into the consistency distillation (CD) process of a pre-trained
T2V model. Notably, we directly optimize rewards associated with single-step
generations that arise naturally from computing the CD loss, effectively
bypassing the memory constraints imposed by backpropagating gradients through
an iterative sampling process. Remarkably, the 4-step generations from our
T2V-Turbo achieve the highest total score on VBench, even surpassing Gen-2 and
Pika. We further conduct human evaluations to corroborate the results,
validating that the 4-step generations from our T2V-Turbo are preferred over
the 50-step DDIM samples from their teacher models, representing more than a
tenfold acceleration while improving video generation quality.Summary
AI-Generated Summary