VideoDrafter: LLMを用いたコンテンツ一貫性を保ったマルチシーン動画生成

要旨

近年の拡散モデルにおける革新とブレークスルーは、与えられたプロンプトから高品質な動画を生成する可能性を大幅に拡大しました。既存の研究の多くは、単一の背景で1つのイベントが発生するシングルシーンシナリオに取り組んでいます。しかし、マルチシーン動画の生成に拡張することは容易ではなく、シーン間の論理を適切に管理しつつ、主要なコンテンツの視覚的一貫性を維持する必要があります。本論文では、コンテンツの一貫性を保ったマルチシーン動画生成のための新しいフレームワーク、VideoDrafterを提案します。技術的には、VideoDrafterは大規模言語モデル（LLM）を活用して、入力プロンプトを包括的なマルチシーンスクリプトに変換します。これにより、LLMが学習した論理的知識を活用します。各シーンのスクリプトには、イベントの説明、前景/背景のエンティティ、およびカメラの動きが含まれます。VideoDrafterはスクリプト全体に共通するエンティティを特定し、LLMに各エンティティの詳細を記述させます。その結果得られたエンティティの説明は、テキストから画像を生成するモデルに入力され、各エンティティの参照画像を生成します。最後に、VideoDrafterは、参照画像、イベントの記述プロンプト、およびカメラの動きを考慮した拡散プロセスを通じて各シーン動画を生成し、マルチシーン動画を出力します。拡散モデルは、参照画像を条件として取り入れ、マルチシーン動画のコンテンツ一貫性を強化するためのアラインメントとして機能します。大規模な実験により、VideoDrafterが視覚的品質、コンテンツの一貫性、およびユーザー選好の点で最先端の動画生成モデルを凌駕することが実証されました。

English

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. Technically, VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoDrafter identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoDrafter outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.

VideoDrafter: LLMを用いたコンテンツ一貫性を保ったマルチシーン動画生成

VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

要旨

Support