跨帧表徵對齊於視頻擴散模型的微調應用

摘要

在用戶層面微調視頻擴散模型（VDMs）以生成反映訓練數據特定屬性的視頻，雖然具有實際重要性，但仍面臨顯著挑戰且研究不足。與此同時，近期如表示對齊（REPA）等工作在提升基於DiT的圖像擴散模型的收斂性和質量方面展現出潛力，通過將其內部隱藏狀態與外部預訓練視覺特徵對齊或同化，這表明其在VDM微調中的應用潛力。在本研究中，我們首先提出了一種將REPA直接應用於VDMs的簡單適應方法，並通過實驗證明，儘管該方法在促進收斂方面有效，但在保持幀間語義一致性方面卻非最優。為解決這一限制，我們引入了跨幀表示對齊（CREPA），這是一種新穎的正則化技術，它將一幀的隱藏狀態與鄰近幀的外部特徵進行對齊。在大規模VDMs（包括CogVideoX-5B和Hunyuan Video）上的實證評估表明，當使用如LoRA等參數高效方法進行微調時，CREPA在提升視覺保真度和跨幀語義連貫性方面均有所改善。我們進一步在多樣化屬性的數據集上驗證了CREPA，確認了其廣泛的適用性。項目頁面：https://crepavideo.github.io

English

Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io

跨帧表徵對齊於視頻擴散模型的微調應用

Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

摘要

Support