Sketching the Future (STF): Anwendung von bedingten Kontrolltechniken auf Text-zu-Video-Modelle

Zusammenfassung

Die zunehmende Verbreitung von Videoinhalten erfordert effiziente und flexible neuronale Netzwerkansätze zur Generierung neuer Videoinhalte. In diesem Artikel schlagen wir einen neuartigen Ansatz vor, der Zero-Shot-Text-zu-Video-Generierung mit ControlNet kombiniert, um die Ausgabe dieser Modelle zu verbessern. Unsere Methode nimmt mehrere skizzierte Frames als Eingabe und erzeugt eine Videoausgabe, die dem Fluss dieser Frames entspricht. Sie baut auf der Text-to-Video-Zero-Architektur auf und integriert ControlNet, um zusätzliche Eingabebedingungen zu ermöglichen. Indem wir zunächst Frames zwischen den eingegebenen Skizzen interpolieren und dann Text-to-Video Zero unter Verwendung des neuen interpolierten Frames-Videos als Kontrolltechnik ausführen, nutzen wir die Vorteile sowohl der Zero-Shot-Text-zu-Video-Generierung als auch der robusten Kontrolle durch ControlNet. Experimente zeigen, dass unsere Methode hochwertige und bemerkenswert konsistente Videoinhalte erzeugt, die die vom Benutzer beabsichtigte Bewegung des Subjekts im Video genauer widerspiegeln. Wir stellen ein umfassendes Ressourcenpaket zur Verfügung, einschließlich eines Demo-Videos, einer Projektwebsite, eines Open-Source-GitHub-Repositorys und eines Colab-Playgrounds, um weitere Forschung und Anwendung unseres vorgeschlagenen Ansatzes zu fördern.

English

The proliferation of video content demands efficient and flexible neural network based approaches for generating new video content. In this paper, we propose a novel approach that combines zero-shot text-to-video generation with ControlNet to improve the output of these models. Our method takes multiple sketched frames as input and generates video output that matches the flow of these frames, building upon the Text-to-Video Zero architecture and incorporating ControlNet to enable additional input conditions. By first interpolating frames between the inputted sketches and then running Text-to-Video Zero using the new interpolated frames video as the control technique, we leverage the benefits of both zero-shot text-to-video generation and the robust control provided by ControlNet. Experiments demonstrate that our method excels at producing high-quality and remarkably consistent video content that more accurately aligns with the user's intended motion for the subject within the video. We provide a comprehensive resource package, including a demo video, project website, open-source GitHub repository, and a Colab playground to foster further research and application of our proposed method.

Sketching the Future (STF): Anwendung von bedingten Kontrolltechniken auf Text-zu-Video-Modelle

Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models

Zusammenfassung

Support