ChatPaper.aiChatPaper

Reg-DPO:基于GT配对的SFT正则化直接偏好优化方法及其在视频生成中的改进应用

Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

November 3, 2025
作者: Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, Jun Zhang
cs.AI

摘要

近期研究表明,直接偏好優化(DPO)作為一種無需獎勵函數的高效方法,能有效提升影片生成品質。然而現有方法大多沿用圖像領域的範式,且主要基於小規模模型(約20億參數)開發,難以應對影片任務特有的三大挑戰:高昂的資料建置成本、訓練不穩定性及巨大記憶體消耗。為突破這些限制,我們提出GT-Pair自動化建置高品質偏好對,以真實影片作為正樣本、模型生成影片作為負樣本,無需任何外部標註。我們進一步提出Reg-DPO,將監督式微調(SFT)損失作為正則化項融入DPO目標函數,顯著提升訓練穩定性與生成保真度。此外,通過結合完全分片資料並行(FSDP)框架與多重記憶體優化技術,我們的訓練容量較單獨使用FSDP提升近三倍。在多重資料集的圖生影片(I2V)與文生影片(T2V)任務實驗表明,本方法持續優於現有方案,呈現更卓越的影片生成品質。
English
Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO objective to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.
PDF11December 2, 2025