跨帧表示对齐用于微调视频扩散模型

摘要

在用户层面微调视频扩散模型（VDMs）以生成反映训练数据特定属性的视频，虽然具有重要的实际意义，但仍面临显著挑战且研究不足。与此同时，近期工作如表示对齐（REPA）通过将内部隐藏状态与外部预训练视觉特征对齐或融合，展现了提升基于DiT的图像扩散模型收敛性和质量的潜力，暗示了其在VDM微调中的应用前景。本研究中，我们首先提出了REPA在VDMs中的直接适配方案，并通过实验证明，尽管其对收敛有效，但在保持帧间语义一致性方面表现欠佳。针对这一局限，我们引入了跨帧表示对齐（CREPA），一种新颖的正则化技术，它通过将一帧的隐藏状态与邻近帧的外部特征对齐来优化模型。在大规模VDMs（如CogVideoX-5B和Hunyuan Video）上的实证评估显示，CREPA在使用参数高效方法（如LoRA）微调时，不仅提升了视觉保真度，还增强了帧间语义连贯性。我们进一步在具有不同属性的多样化数据集上验证了CREPA，证实了其广泛的适用性。项目页面：https://crepavideo.github.io

English

Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io

跨帧表示对齐用于微调视频扩散模型

Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

摘要

Support