控制視頻：無需訓練的可控文本到視頻生成

摘要

基於文本驅動的擴散模型在圖像生成方面取得了前所未有的能力，而其視頻對應物仍然落後，這是由於時間建模的訓練成本過高。除了訓練負擔之外，生成的視頻還存在外觀不一致和結構閃爍的問題，尤其是在長視頻合成中。為應對這些挑戰，我們設計了一個名為 ControlVideo 的無需訓練的框架，以實現自然高效的文本到視頻生成。ControlVideo 是從 ControlNet 改編而來，利用從輸入運動序列中獲取的粗略結構一致性，並引入三個模塊來改進視頻生成。首先，為確保幀之間的外觀一致性，ControlVideo 在自注意力模塊中添加了完全跨幀交互。其次，為減輕閃爍效應，它引入了一個交錯幀平滑器，對交替幀進行幀內插值。最後，為高效生成長視頻，它利用分層採樣器分別合成每個具有整體一致性的短片段。憑藉這些模塊，ControlVideo 在廣泛的運動提示對上在量化和質化上均優於當前技術水準。值得注意的是，由於高效的設計，它可以在幾分鐘內使用一個 NVIDIA 2080Ti 生成短視頻和長視頻。代碼可在 https://github.com/YBYBZhang/ControlVideo 找到。

English

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a training-free framework called ControlVideo to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

控制視頻：無需訓練的可控文本到視頻生成

ControlVideo: Training-free Controllable Text-to-Video Generation

摘要

Support