未来草图（STF）：将条件控制技术应用于文本到视频模型

摘要

视频内容的激增要求采用高效灵活的基于神经网络的方法来生成新的视频内容。本文提出了一种新颖的方法，结合了零样本文本到视频生成和ControlNet，以改善这些模型的输出。我们的方法以多个草图帧作为输入，并生成与这些帧流畅匹配的视频输出，基于文本到视频零架构，并整合ControlNet以实现额外的输入条件。通过首先在输入的草图之间插值帧，然后运行使用新插值帧视频作为控制技术的文本到视频零，我们利用了零样本文本到视频生成和ControlNet提供的稳健控制的优势。实验证明，我们的方法在生成高质量且一致性显著的视频内容方面表现出色，更准确地符合用户对视频中主体运动的意图。我们提供了全面的资源包，包括演示视频、项目网站、开源GitHub存储库和Colab平台，以促进进一步研究和应用我们提出的方法。

English

The proliferation of video content demands efficient and flexible neural network based approaches for generating new video content. In this paper, we propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames, building upon the Text-to-Video Zero architecture and incorporating ControlNet to enable additional input conditions. By first interpolating frames between the inputted sketches and then running Text-to-Video Zero using the new interpolated frames video as the control technique, we leverage the benefits of both zero-shot text-to-video generation and the robust control provided by ControlNet. Experiments demonstrate that our method excels at producing high-quality and remarkably consistent video content that more accurately aligns with the user's intended motion for the subject within the video. We provide a comprehensive resource package, including a demo video, project website, open-source GitHub repository, and a Colab playground to foster further research and application of our proposed method.

未来草图（STF）：将条件控制技术应用于文本到视频模型

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models

摘要

Support