跨帧表徵對齊於視頻擴散模型的微調應用
Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models
June 10, 2025
作者: Sungwon Hwang, Hyojin Jang, Kinam Kim, Minho Park, Jaegul choo
cs.AI
摘要
在用戶層面微調視頻擴散模型(VDMs)以生成反映訓練數據特定屬性的視頻,雖然具有實際重要性,但仍面臨顯著挑戰且研究不足。與此同時,近期如表示對齊(REPA)等工作在提升基於DiT的圖像擴散模型的收斂性和質量方面展現出潛力,通過將其內部隱藏狀態與外部預訓練視覺特徵對齊或同化,這表明其在VDM微調中的應用潛力。在本研究中,我們首先提出了一種將REPA直接應用於VDMs的簡單適應方法,並通過實驗證明,儘管該方法在促進收斂方面有效,但在保持幀間語義一致性方面卻非最優。為解決這一限制,我們引入了跨幀表示對齊(CREPA),這是一種新穎的正則化技術,它將一幀的隱藏狀態與鄰近幀的外部特徵進行對齊。在大規模VDMs(包括CogVideoX-5B和Hunyuan Video)上的實證評估表明,當使用如LoRA等參數高效方法進行微調時,CREPA在提升視覺保真度和跨幀語義連貫性方面均有所改善。我們進一步在多樣化屬性的數據集上驗證了CREPA,確認了其廣泛的適用性。項目頁面:https://crepavideo.github.io
English
Fine-tuning Video Diffusion Models (VDMs) at the user level to generate
videos that reflect specific attributes of training data presents notable
challenges, yet remains underexplored despite its practical importance.
Meanwhile, recent work such as Representation Alignment (REPA) has shown
promise in improving the convergence and quality of DiT-based image diffusion
models by aligning, or assimilating, its internal hidden states with external
pretrained visual features, suggesting its potential for VDM fine-tuning. In
this work, we first propose a straightforward adaptation of REPA for VDMs and
empirically show that, while effective for convergence, it is suboptimal in
preserving semantic consistency across frames. To address this limitation, we
introduce Cross-frame Representation Alignment (CREPA), a novel regularization
technique that aligns hidden states of a frame with external features from
neighboring frames. Empirical evaluations on large-scale VDMs, including
CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual
fidelity and cross-frame semantic coherence when fine-tuned with
parameter-efficient methods such as LoRA. We further validate CREPA across
diverse datasets with varying attributes, confirming its broad applicability.
Project page: https://crepavideo.github.io