DenseDPO: 비디오 확산 모델을 위한 세밀한 시간적 선호도 최적화

초록

Direct Preference Optimization(DPO)는 최근 텍스트-비디오 확산 모델의 사후 학습 기법으로 적용되고 있다. 학습 데이터를 얻기 위해, 주석자는 독립적인 노이즈로부터 생성된 두 비디오 간의 선호도를 제공하도록 요청받는다. 그러나 이 접근 방식은 세밀한 비교를 방해하며, 저자들은 이 방법이 시각적 결함이 적은 저모션 클립에 대한 주석자의 편향을 유발한다고 지적한다. 본 연구에서는 이러한 단점을 해결하기 위해 DenseDPO라는 방법을 소개하며, 세 가지 기여를 한다. 첫째, DPO를 위한 각 비디오 쌍을 생성할 때, 원본 비디오의 손상된 복사본을 디노이징하여 유사한 모션 구조를 가지면서도 지역적 세부 사항에서 차이가 나는 정렬된 쌍을 만든다. 이를 통해 모션 편향을 효과적으로 중립화한다. 둘째, 결과적으로 얻은 시간적 정렬을 활용하여 전체 클립이 아닌 짧은 세그먼트에 대한 선호도를 라벨링함으로써 더 밀도 높고 정확한 학습 신호를 얻는다. DenseDPO는 라벨링된 데이터의 1/3만 사용하여도 기본 DPO 대비 모션 생성 능력을 크게 향상시키며, 텍스트 정렬, 시각적 품질, 시간적 일관성에서는 동등한 성능을 보인다. 마지막으로, DenseDPO가 오프더셰프 비전 언어 모델(VLM)을 사용한 자동 선호도 주석을 가능하게 함을 보인다: GPT는 작업에 특화된 비디오 보상 모델과 유사하게 세그먼트 수준의 선호도를 정확하게 예측하며, 이러한 라벨로 학습된 DenseDPO는 인간 라벨을 사용한 경우와 근접한 성능을 달성한다.

English

Direct Preference Optimization (DPO) has recently been applied as a post-training technique for text-to-video diffusion models. To obtain training data, annotators are asked to provide preferences between two videos generated from independent noise. However, this approach prohibits fine-grained comparisons, and we point out that it biases the annotators towards low-motion clips as they often contain fewer visual artifacts. In this work, we introduce DenseDPO, a method that addresses these shortcomings by making three contributions. First, we create each video pair for DPO by denoising corrupted copies of a ground truth video. This results in aligned pairs with similar motion structures while differing in local details, effectively neutralizing the motion bias. Second, we leverage the resulting temporal alignment to label preferences on short segments rather than entire clips, yielding a denser and more precise learning signal. With only one-third of the labeled data, DenseDPO greatly improves motion generation over vanilla DPO, while matching it in text alignment, visual quality, and temporal consistency. Finally, we show that DenseDPO unlocks automatic preference annotation using off-the-shelf Vision Language Models (VLMs): GPT accurately predicts segment-level preferences similar to task-specifically fine-tuned video reward models, and DenseDPO trained on these labels achieves performance close to using human labels.

DenseDPO: 비디오 확산 모델을 위한 세밀한 시간적 선호도 최적화

DenseDPO: Fine-Grained Temporal Preference Optimization for Video Diffusion Models

초록

Support