DenseDPO:面向视频扩散模型的细粒度时序偏好优化
DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models
June 4, 2025
作者: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
cs.AI
摘要
直接偏好优化(Direct Preference Optimization, DPO)最近被应用为文本到视频扩散模型的后训练技术。为了获取训练数据,标注者被要求对从独立噪声生成的两个视频提供偏好。然而,这种方法限制了细粒度比较,我们指出它会使标注者偏向于低运动片段,因为这些片段通常包含较少的视觉伪影。在本研究中,我们提出了DenseDPO,该方法通过三项贡献解决了这些不足。首先,我们通过去噪真实视频的损坏副本来创建每对DPO视频。这产生了具有相似运动结构但在局部细节上有所不同的对齐视频对,有效中和了运动偏差。其次,我们利用由此产生的时间对齐,在短片段而非整个片段上标注偏好,从而产生更密集且更精确的学习信号。仅使用三分之一标注数据,DenseDPO在运动生成方面显著优于原始DPO,同时在文本对齐、视觉质量和时间一致性方面与之相当。最后,我们展示了DenseDPO能够利用现成的视觉语言模型(Vision Language Models, VLMs)实现自动偏好标注:GPT准确预测了与任务特定微调的视频奖励模型相似的片段级偏好,而基于这些标签训练的DenseDPO在性能上接近使用人类标签的效果。
English
Direct Preference Optimization (DPO) has recently been applied as a
post-training technique for text-to-video diffusion models. To obtain training
data, annotators are asked to provide preferences between two videos generated
from independent noise. However, this approach prohibits fine-grained
comparisons, and we point out that it biases the annotators towards low-motion
clips as they often contain fewer visual artifacts. In this work, we introduce
DenseDPO, a method that addresses these shortcomings by making three
contributions. First, we create each video pair for DPO by denoising corrupted
copies of a ground truth video. This results in aligned pairs with similar
motion structures while differing in local details, effectively neutralizing
the motion bias. Second, we leverage the resulting temporal alignment to label
preferences on short segments rather than entire clips, yielding a denser and
more precise learning signal. With only one-third of the labeled data, DenseDPO
greatly improves motion generation over vanilla DPO, while matching it in text
alignment, visual quality, and temporal consistency. Finally, we show that
DenseDPO unlocks automatic preference annotation using off-the-shelf Vision
Language Models (VLMs): GPT accurately predicts segment-level preferences
similar to task-specifically fine-tuned video reward models, and DenseDPO
trained on these labels achieves performance close to using human labels.