视频指导：利用人类反馈指导视频传播模型

摘要

扩散模型已成为视频生成的事实范式。然而，它们对各种质量的大规模网络数据的依赖通常会产生视觉上不吸引人且与文本提示不一致的结果。为了解决这个问题，我们提出了InstructVideo，通过奖励微调用人类反馈指导文本到视频扩散模型。InstructVideo有两个关键要素：1）为了改善通过完整DDIM采样链生成而引起的奖励微调成本，我们将奖励微调重新构建为编辑。通过利用扩散过程来破坏采样视频，InstructVideo仅需要对DDIM采样链进行部分推断，降低微调成本同时提高微调效率。2）为了减轻缺乏专门的视频奖励模型用于人类偏好的问题，我们重新利用已建立的图像奖励模型，例如HPSv2。为此，我们提出了分段视频奖励，一种基于分段稀疏采样提供奖励信号的机制，以及时间衰减奖励，一种在微调过程中减轻时间建模退化的方法。广泛的定性和定量实验证实了在InstructVideo中使用图像奖励模型的实用性和功效，显著提高了生成视频的视觉质量，而不会损害泛化能力。代码和模型将公开提供。

English

Diffusion models have emerged as the de facto paradigm for video generation. However, their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain, we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video, InstructVideo requires only partial inference of the DDIM sampling chain, reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences, we repurpose established image reward models, e.g., HPSv2. To this end, we propose Segmental Video Reward, a mechanism to provide reward signals based on segmental sparse sampling, and Temporally Attenuated Reward, a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments, both qualitative and quantitative, validate the practicality and efficacy of using image reward models in InstructVideo, significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models will be made publicly available.

视频指导：利用人类反馈指导视频传播模型

InstructVideo: Instructing Video Diffusion Models with Human Feedback

摘要

Support