ControlVideo: 학습 없이 제어 가능한 텍스트-비디오 생성

초록

텍스트 기반 확산 모델은 이미지 생성 분야에서 전례 없는 능력을 발휘하고 있지만, 시간적 모델링의 과도한 학습 비용으로 인해 비디오 생성은 여전히 뒤처져 있습니다. 학습 부담 외에도 생성된 비디오는 특히 긴 비디오 합성에서 외관 불일치와 구조적 깜빡임 문제를 겪습니다. 이러한 문제를 해결하기 위해, 우리는 자연스럽고 효율적인 텍스트-투-비디오 생성을 가능하게 하는 학습이 필요 없는 프레임워크인 ControlVideo를 설계했습니다. ControlVideo는 ControlNet에서 적응되어 입력된 동작 시퀀스로부터 대략적인 구조적 일관성을 활용하고, 비디오 생성을 개선하기 위해 세 가지 모듈을 도입합니다. 첫째, 프레임 간의 외관 일관성을 보장하기 위해 ControlVideo는 자기 주의(self-attention) 모듈에 완전한 프레임 간 상호작용을 추가합니다. 둘째, 깜빡임 효과를 완화하기 위해 교차 프레임 보간을 사용하는 인터리브 프레임 스무더를 도입합니다. 마지막으로, 긴 비디오를 효율적으로 생성하기 위해 전체적인 일관성을 유지하며 각 짧은 클립을 별도로 합성하는 계층적 샘플러를 활용합니다. 이러한 모듈을 통해 ControlVideo는 다양한 동작-프롬프트 쌍에서 양적 및 질적으로 최신 기술을 능가합니다. 특히, 효율적인 설계 덕분에 하나의 NVIDIA 2080Ti를 사용하여 짧은 비디오와 긴 비디오를 모두 몇 분 내에 생성할 수 있습니다. 코드는 https://github.com/YBYBZhang/ControlVideo에서 확인할 수 있습니다.

English

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a training-free framework called ControlVideo to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

ControlVideo: 학습 없이 제어 가능한 텍스트-비디오 생성

ControlVideo: Training-free Controllable Text-to-Video Generation

초록

Support