ChatPaper.aiChatPaper

DenseDPO:面向视频扩散模型的细粒度时序偏好优化

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

June 4, 2025
作者: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
cs.AI

摘要

直接偏好优化(Direct Preference Optimization, DPO)最近被应用为文本到视频扩散模型的后训练技术。为了获取训练数据,标注者被要求对从独立噪声生成的两个视频提供偏好。然而,这种方法限制了细粒度比较,我们指出它会使标注者偏向于低运动片段,因为这些片段通常包含较少的视觉伪影。在本研究中,我们提出了DenseDPO,该方法通过三项贡献解决了这些不足。首先,我们通过去噪真实视频的损坏副本来创建每对DPO视频。这产生了具有相似运动结构但在局部细节上有所不同的对齐视频对,有效中和了运动偏差。其次,我们利用由此产生的时间对齐,在短片段而非整个片段上标注偏好,从而产生更密集且更精确的学习信号。仅使用三分之一标注数据,DenseDPO在运动生成方面显著优于原始DPO,同时在文本对齐、视觉质量和时间一致性方面与之相当。最后,我们展示了DenseDPO能够利用现成的视觉语言模型(Vision Language Models, VLMs)实现自动偏好标注:GPT准确预测了与任务特定微调的视频奖励模型相似的片段级偏好,而基于这些标签训练的DenseDPO在性能上接近使用人类标签的效果。
English
Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
PDF132June 5, 2025