AccVideo: 합성 데이터셋을 활용한 비디오 확산 모델 가속화

초록

디퓨전 모델은 비디오 생성 분야에서 놀라운 발전을 이루어 왔습니다. 그러나 반복적인 노이즈 제거 특성으로 인해 비디오를 생성하기 위해 많은 수의 추론 단계가 필요하며, 이는 속도가 느리고 계산 비용이 많이 듭니다. 본 논문에서는 기존 디퓨전 증류 방법의 문제점을 상세히 분석하고, 합성 데이터셋을 활용하여 비디오 디퓨전 모델의 추론 단계를 줄이는 새로운 효율적인 방법인 AccVideo를 제안합니다. 우리는 사전 학습된 비디오 디퓨전 모델을 활용하여 여러 유효한 노이즈 제거 궤적을 생성하고 이를 합성 데이터셋으로 사용함으로써 증류 과정에서 불필요한 데이터 포인트의 사용을 제거합니다. 이 합성 데이터셋을 기반으로, 노이즈에서 비디오로의 매핑을 학습하기 위해 노이즈 제거 궤적의 핵심 데이터 포인트를 활용하는 궤적 기반의 적은 단계 지도 방식을 설계하여 더 적은 단계로 비디오를 생성할 수 있게 합니다. 또한, 합성 데이터셋이 각 디퓨션 시간 단계에서의 데이터 분포를 포착하므로, 학생 모델의 출력 분포를 합성 데이터셋의 분포와 일치시키기 위한 적대적 학습 전략을 도입하여 비디오 품질을 향상시킵니다. 광범위한 실험을 통해 우리의 모델이 교사 모델 대비 8.5배 빠른 생성 속도를 달성하면서도 유사한 성능을 유지함을 입증했습니다. 기존의 가속화 방법과 비교할 때, 우리의 접근 방식은 더 높은 품질과 해상도(5초, 720x1280, 24fps)의 비디오를 생성할 수 있습니다.

English

Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

AccVideo: 합성 데이터셋을 활용한 비디오 확산 모델 가속화

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

초록

Support