Reg-DPO: ビデオ生成の改善に向けたGTペアを用いたSFT正則化直接選好最適化

要旨

近年、直接選好最適化（DPO）が報酬信号を必要とせず効率的に映像生成品質を向上させる手法として注目されている。しかし、既存手法は画像領域のパラダイムを継承しており、小規模モデル（約20億パラメータ）を主な対象として開発されているため、データ構築コストの高さ、訓練の不安定性、膨大なメモリ消費といった映像タスク特有の課題に対応する能力が限定されていた。これらの制約を克服するため、本研究では実写映像を正例、モデル生成映像を負例として高品質な選好ペアを自動構築するGT-Pairを提案し、外部アノテーションを完全に不要化した。さらに、DPO目的関数にSFT損失を正則化項として組み込むReg-DPOを開発し、訓練の安定性と生成の忠実度を同時に向上させる。加えて、FSDPフレームワークと複数のメモリ最適化技術を統合することで、FSDP単体使用時と比較して約3倍の訓練容量を実現した。複数データセットにおける画像から映像（I2V）およびテキストから映像（T2V）タスクでの大規模実験により、本手法が既存手法を一貫して凌駕し、優れた映像生成品質を達成することを実証した。

English

Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO objective to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

Reg-DPO: ビデオ生成の改善に向けたGTペアを用いたSFT正則化直接選好最適化

Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

要旨

Support