ChatPaper.aiChatPaper

DenseDPO:面向视频扩散模型的细粒度时序偏好优化

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

June 4, 2025
作者: Ziyi Wu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Ashkan Mirzaei, Igor Gilitschenski, Sergey Tulyakov, Aliaksandr Siarohin
cs.AI

摘要

直接偏好优化(DPO)最近被应用于文本到视频扩散模型的训练后优化技术中。为了获取训练数据,标注者被要求对由独立噪声生成的两段视频提供偏好。然而,这种方法限制了细粒度比较的可能性,并且我们指出,它使标注者倾向于选择低运动片段,因为这些片段通常包含较少的视觉伪影。在本研究中,我们提出了DenseDPO方法,通过三项创新来解决这些不足。首先,我们通过去噪处理真实视频的受损副本来创建用于DPO的视频对,从而生成具有相似运动结构但在局部细节上有所差异的对齐视频对,有效消除了运动偏差。其次,我们利用由此产生的时间对齐性,在短片段而非整个视频上标注偏好,提供了更密集且更精确的学习信号。仅使用三分之一标注数据的情况下,DenseDPO在运动生成方面显著优于基础DPO,同时在文本对齐、视觉质量和时间一致性上与之相当。最后,我们展示了DenseDPO能够利用现成的视觉语言模型(VLMs)实现自动偏好标注:GPT能够准确预测与任务特定微调的视频奖励模型相似的片段级偏好,而基于这些标签训练的DenseDPO在性能上接近使用人工标注的效果。
English
Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.
PDF132June 5, 2025