FreeNoise: 노이즈 재스케줄링을 통한 튜닝 없이 더 긴 비디오 확산

초록

대규모 비디오 데이터셋의 등장과 확산 모델(diffusion model)의 발전으로 인해, 텍스트 기반 비디오 생성 기술은 상당한 진전을 이루었습니다. 그러나 기존의 비디오 생성 모델들은 일반적으로 제한된 수의 프레임으로 학습되기 때문에, 추론 과정에서 고화질의 긴 비디오를 생성하는 데 한계가 있습니다. 더욱이, 이러한 모델들은 단일 텍스트 조건만을 지원하는 반면, 실제 생활에서는 비디오 내용이 시간에 따라 변화함에 따라 다중 텍스트 조건이 필요한 경우가 많습니다. 이러한 문제를 해결하기 위해, 본 연구는 다중 텍스트 조건 하에서 더 긴 비디오를 생성할 수 있는 텍스트 기반 능력을 확장하는 가능성을 탐구합니다. 1) 먼저, 비디오 확산 모델에서 초기 노이즈의 영향을 분석합니다. 그리고 이러한 노이즈 관찰을 바탕으로, 사전 학습된 비디오 확산 모델의 생성 능력을 향상시키면서도 내용 일관성을 유지하는, 추가 학습이 필요 없고 시간 효율적인 패러다임인 FreeNoise를 제안합니다. 구체적으로, 모든 프레임에 대해 노이즈를 초기화하는 대신, 장거리 상관 관계를 위해 노이즈 시퀀스를 재조정하고, 이를 윈도우 기반 함수를 통해 시간적 주의를 수행합니다. 2) 또한, 다중 텍스트 프롬프트 조건 하에서 비디오를 생성할 수 있도록 새로운 모션 주입 방법을 설계합니다. 광범위한 실험을 통해, 우리의 패러다임이 비디오 확산 모델의 생성 능력을 확장하는 데 있어 우수성을 입증합니다. 특히, 이전 최고 성능의 방법이 255%의 추가 시간 비용을 발생시킨 반면, 우리의 방법은 약 17%의 미미한 시간 비용만을 발생시킵니다. 생성된 비디오 샘플은 우리 웹사이트(http://haonanqiu.com/projects/FreeNoise.html)에서 확인할 수 있습니다.

English

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.

FreeNoise: 노이즈 재스케줄링을 통한 튜닝 없이 더 긴 비디오 확산

FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling

초록

Support