비디오 확산 모델 미세 조정을 위한 교차 프레임 표현 정렬

초록

사용자 수준에서 비디오 디퓨전 모델(Video Diffusion Models, VDMs)을 미세 조정하여 훈련 데이터의 특정 속성을 반영한 비디오를 생성하는 것은 상당한 도전 과제를 제시하지만, 그 실질적인 중요성에도 불구하고 아직 충분히 탐구되지 않았다. 한편, 최근 Representation Alignment (REPA)와 같은 연구는 DiT 기반 이미지 디퓨전 모델의 수렴 및 품질을 개선하기 위해 내부 은닉 상태(hidden states)를 외부 사전 훈련된 시각적 특징과 정렬 또는 동화시키는 방식으로 유망한 결과를 보여주었으며, 이는 VDM 미세 조정에 대한 잠재력을 시사한다. 본 연구에서는 먼저 REPA를 VDMs에 적용하는 간단한 방법을 제안하고, 이 방법이 수렴에는 효과적이지만 프레임 간의 의미적 일관성을 유지하는 데는 최적이 아니라는 것을 실증적으로 보여준다. 이러한 한계를 해결하기 위해, 우리는 한 프레임의 은닉 상태를 인접 프레임의 외부 특징과 정렬하는 새로운 정규화 기법인 Cross-frame Representation Alignment (CREPA)를 소개한다. CogVideoX-5B 및 Hunyuan Video와 같은 대규모 VDMs에 대한 실증적 평가는 CREPA가 LoRA와 같은 매개변수 효율적 방법으로 미세 조정될 때 시각적 충실도와 프레임 간 의미적 일관성을 모두 개선함을 보여준다. 또한, 다양한 속성을 가진 데이터셋에서 CREPA를 검증하여 그 광범위한 적용 가능성을 확인한다. 프로젝트 페이지: https://crepavideo.github.io

English

Fine-tuning Video Diffusion Models (VDMs) at the user level to generate videos that reflect specific attributes of training data presents notable challenges, yet remains underexplored despite its practical importance. Meanwhile, recent work such as Representation Alignment (REPA) has shown promise in improving the convergence and quality of DiT-based image diffusion models by aligning, or assimilating, its internal hidden states with external pretrained visual features, suggesting its potential for VDM fine-tuning. In this work, we first propose a straightforward adaptation of REPA for VDMs and empirically show that, while effective for convergence, it is suboptimal in preserving semantic consistency across frames. To address this limitation, we introduce Cross-frame Representation Alignment (CREPA), a novel regularization technique that aligns hidden states of a frame with external features from neighboring frames. Empirical evaluations on large-scale VDMs, including CogVideoX-5B and Hunyuan Video, demonstrate that CREPA improves both visual fidelity and cross-frame semantic coherence when fine-tuned with parameter-efficient methods such as LoRA. We further validate CREPA across diverse datasets with varying attributes, confirming its broad applicability. Project page: https://crepavideo.github.io

비디오 확산 모델 미세 조정을 위한 교차 프레임 표현 정렬

Cross-Frame Representation Alignment for Fine-Tuning Video Diffusion Models

초록

Support