VideoDirectorGPT: LLMガイドによる計画を通じた一貫性のあるマルチシーン動画生成

要旨

近年のテキストからビデオ（T2V）生成手法は大きな進歩を遂げているが、これらの研究の多くは、単一の背景を持つ単一のイベントの短いビデオクリップ（つまり、単一シーンのビデオ）の生成に焦点を当てている。一方、最近の大規模言語モデル（LLM）は、画像生成モデルなどの下流の視覚モジュールを制御するためのレイアウトやプログラムを生成する能力を示している。これにより、重要な疑問が浮かび上がる：これらのLLMに埋め込まれた知識を活用して、時間的に一貫した長いビデオを生成することは可能か？本論文では、LLMの知識を活用したビデオコンテンツの計画とグラウンディングされたビデオ生成のための新しいフレームワーク、VideoDirectorGPTを提案する。具体的には、単一のテキストプロンプトが与えられた場合、まずビデオプランナーLLM（GPT-4）にそれを「ビデオプラン」に拡張するよう依頼する。これには、シーンの説明、それぞれのレイアウトを持つエンティティ、各シーンの背景、およびエンティティと背景の一貫性グループの生成が含まれる。次に、このビデオプランナーの出力に基づいて、ビデオジェネレータであるLayout2Vidは、空間レイアウトを明示的に制御し、シーン間でエンティティ/背景の時間的一貫性を維持することができる。これは、画像レベルのアノテーションのみでトレーニングされているにもかかわらず実現される。実験結果は、VideoDirectorGPTフレームワークが、単一シーンおよび複数シーンのビデオ生成におけるレイアウトと動きの制御を大幅に改善し、シーン間で視覚的一貫性を持つ複数シーンのビデオを生成できることを示している。また、オープンドメインの単一シーンT2V生成において、SOTAと競争力のある性能を達成している。さらに、本フレームワークは、レイアウトガイダンスの強度を動的に制御することができ、ユーザー提供の画像を使用してビデオを生成することもできる。本フレームワークが、LLMの計画能力を一貫した長いビデオ生成に統合するための将来の研究にインスピレーションを与えることを期待している。

English

Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.

VideoDirectorGPT: LLMガイドによる計画を通じた一貫性のあるマルチシーン動画生成

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

要旨

Support