ビデオ拡散モデルのファインチューニングのためのクロスフレーム表現アラインメント

要旨

ユーザーレベルでのビデオ拡散モデル（VDMs）のファインチューニングにより、トレーニングデータの特定の属性を反映したビデオを生成することは、重要な課題を提示するものの、その実用的な重要性にもかかわらず、まだ十分に探求されていない。一方で、表現アライメント（REPA）などの最近の研究は、内部の隠れ状態を外部の事前学習済み視覚特徴と整合または同化させることで、DiTベースの画像拡散モデルの収束と品質を向上させる可能性を示しており、VDMのファインチューニングへの応用が期待される。本研究では、まずVDMに対するREPAの直接的な適応を提案し、収束には効果的であるものの、フレーム間の意味的一貫性を維持する点では最適ではないことを実証的に示す。この制限に対処するため、隣接フレームの外部特徴とフレームの隠れ状態を整合させる新しい正則化手法であるクロスフレーム表現アライメント（CREPA）を導入する。CogVideoX-5BやHunyuan Videoなどの大規模VDMに対する実証評価により、CREPAがLoRAなどのパラメータ効率の良い手法でファインチューニングされた場合に、視覚的忠実度とフレーム間の意味的整合性の両方を向上させることが示された。さらに、様々な属性を持つ多様なデータセットでCREPAを検証し、その広範な適用性を確認した。プロジェクトページ: https://crepavideo.github.io

English

Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io