未来のスケッチ（STF）：テキストからビデオ生成モデルへの条件付き制御技術の適用

要旨

ビデオコンテンツの普及に伴い、新しいビデオコンテンツを生成するための効率的で柔軟なニューラルネットワークベースのアプローチが求められています。本論文では、ゼロショットのテキストからビデオ生成とControlNetを組み合わせた新しいアプローチを提案し、これらのモデルの出力を改善します。本手法は、複数のスケッチフレームを入力として受け取り、これらのフレームの流れに一致するビデオ出力を生成します。Text-to-Video Zeroアーキテクチャを基盤とし、ControlNetを組み込むことで追加の入力条件を可能にします。まず、入力されたスケッチ間のフレームを補間し、その後、新しい補間フレームビデオを制御技術として使用してText-to-Video Zeroを実行することで、ゼロショットのテキストからビデオ生成の利点とControlNetが提供する堅牢な制御の両方を活用します。実験により、本手法が高品質で非常に一貫性のあるビデオコンテンツを生成し、ユーザーが意図したビデオ内の被写体の動きにより正確に一致することを実証しています。さらに、デモビデオ、プロジェクトウェブサイト、オープンソースのGitHubリポジトリ、Colabプレイグラウンドを含む包括的なリソースパッケージを提供し、提案手法のさらなる研究と応用を促進します。

English

The proliferation of video content demands efficient and flexible neural network based approaches for generating new video content. In this paper, we propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames, building upon the Text-to-Video Zero architecture and incorporating ControlNet to enable additional input conditions. By first interpolating frames between the inputted sketches and then running Text-to-Video Zero using the new interpolated frames video as the control technique, we leverage the benefits of both zero-shot text-to-video generation and the robust control provided by ControlNet. Experiments demonstrate that our method excels at producing high-quality and remarkably consistent video content that more accurately aligns with the user's intended motion for the subject within the video. We provide a comprehensive resource package, including a demo video, project website, open-source GitHub repository, and a Colab playground to foster further research and application of our proposed method.

未来のスケッチ（STF）：テキストからビデオ生成モデルへの条件付き制御技術の適用

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models

要旨

Support