未來的草圖（STF）：將條件控制技術應用於文本到視頻模型

摘要

隨著視頻內容的激增，需要高效靈活的基於神經網絡的方法來生成新的視頻內容。本文提出了一種新方法，將零樣本文本到視頻生成與ControlNet結合，以改善這些模型的輸出。我們的方法將多個草圖幀作為輸入，生成與這些幀流動匹配的視頻輸出，建立在文本到視頻零架構的基礎上，並整合ControlNet以實現額外的輸入條件。通過首先在輸入的草圖之間插值幀，然後運行使用新插值幀視頻作為控制技術的文本到視頻零，我們利用了零樣本文本到視頻生成和ControlNet提供的強大控制的優勢。實驗表明，我們的方法在生成高質量和一致性顯著的視頻內容方面表現出色，更準確地符合用戶對視頻中主題運動的意圖。我們提供了一個全面的資源包，包括演示視頻、項目網站、開源GitHub存儲庫和Colab平台，以促進進一步研究和應用我們提出的方法。

English

The proliferation of video content demands efficient and flexible neural network based approaches for generating new video content. In this paper, we propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames, building upon the Text-to-Video Zero architecture and incorporating ControlNet to enable additional input conditions. By first interpolating frames between the inputted sketches and then running Text-to-Video Zero using the new interpolated frames video as the control technique, we leverage the benefits of both zero-shot text-to-video generation and the robust control provided by ControlNet. Experiments demonstrate that our method excels at producing high-quality and remarkably consistent video content that more accurately aligns with the user's intended motion for the subject within the video. We provide a comprehensive resource package, including a demo video, project website, open-source GitHub repository, and a Colab playground to foster further research and application of our proposed method.

未來的草圖（STF）：將條件控制技術應用於文本到視頻模型

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models

摘要

Support