DenseDPO:面向视频扩散模型的细粒度时序偏好优化
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
June 4, 2025
作者: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
cs.AI
摘要
直接偏好优化(DPO)最近被应用于文本到视频扩散模型的训练后优化技术中。为了获取训练数据,标注者被要求对由独立噪声生成的两段视频提供偏好。然而,这种方法限制了细粒度比较的可能性,并且我们指出,它使标注者倾向于选择低运动片段,因为这些片段通常包含较少的视觉伪影。在本研究中,我们提出了DenseDPO方法,通过三项创新来解决这些不足。首先,我们通过去噪处理真实视频的受损副本来创建用于DPO的视频对,从而生成具有相似运动结构但在局部细节上有所差异的对齐视频对,有效消除了运动偏差。其次,我们利用由此产生的时间对齐性,在短片段而非整个视频上标注偏好,提供了更密集且更精确的学习信号。仅使用三分之一标注数据的情况下,DenseDPO在运动生成方面显著优于基础DPO,同时在文本对齐、视觉质量和时间一致性上与之相当。最后,我们展示了DenseDPO能够利用现成的视觉语言模型(VLMs)实现自动偏好标注:GPT能够准确预测与任务特定微调的视频奖励模型相似的片段级偏好,而基于这些标签训练的DenseDPO在性能上接近使用人工标注的效果。
English
Direct Preference Optimization (DPO) has recently been applied as a
post-training technique for text-to-video diffusion models. To obtain training
data, annotators are asked to provide preferences between two videos generated
from independent noise. However, this approach prohibits fine-grained
comparisons, and we point out that it biases the annotators towards low-motion
clips as they often contain fewer visual artifacts. In this work, we introduce
DenseDPO, a method that addresses these shortcomings by making three
contributions. First, we create each video pair for DPO by denoising corrupted
copies of a ground truth video. This results in aligned pairs with similar
motion structures while differing in local details, effectively neutralizing
the motion bias. Second, we leverage the resulting temporal alignment to label
preferences on short segments rather than entire clips, yielding a denser and
more precise learning signal. With only one-third of the labeled data, DenseDPO
greatly improves motion generation over vanilla DPO, while matching it in text
alignment, visual quality, and temporal consistency. Finally, we show that
DenseDPO unlocks automatic preference annotation using off-the-shelf Vision
Language Models (VLMs): GPT accurately predicts segment-level preferences
similar to task-specifically fine-tuned video reward models, and DenseDPO
trained on these labels achieves performance close to using human labels.