LLMディレクターを用いた構成的3D認識ビデオ生成

要旨

強力な生成モデルと大規模なインターネットデータの利用により、テキストからビデオを生成する分野で大きな進展が見られています。しかしながら、生成されたビデオ内の個々の概念、例えば特定のキャラクターの動きや外見、視点の移動などを正確に制御するという重要な課題が残っています。本研究では、各概念を3D表現で個別に生成し、それらを大規模言語モデル（LLM）と2D拡散モデルの事前知識と組み合わせる新しいパラダイムを提案しています。具体的には、入力されたテキストプロンプトに対して、以下の3段階からなる手法を提案しています。1) 複雑なクエリを複数のサブプロンプトに分解し、ビデオ内の個々の概念（例：シーン、オブジェクト、動き）を示すようにLLMを利用し、事前学習済みの専門モデルを呼び出して対応する概念の3D表現を取得します。2) これらの表現を構成するために、マルチモーダルLLMに粗いガイダンスを与え、オブジェクトの軌道のスケールと座標に関する情報を生成させます。3) 生成されたフレームが自然な画像分布に従うようにするために、2D拡散事前知識を活用し、スコア蒸留サンプリングを使用して構成を洗練させます。幅広い実験により、当手法が多様な動きと各概念に対する柔軟な制御を持つ高品質なビデオをテキストから生成できることが示されています。プロジェクトページ：https://aka.ms/c3v。

English

Significant progress has been made in text-to-video generation through the use of powerful generative models and large-scale internet data. However, substantial challenges remain in precisely controlling individual concepts within the generated video, such as the motion and appearance of specific characters and the movement of viewpoints. In this work, we propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models. Specifically, given an input textual prompt, our scheme consists of three stages: 1) We leverage LLM as the director to first decompose the complex query into several sub-prompts that indicate individual concepts within the video~(e.g., scene, objects, motions), then we let LLM to invoke pre-trained expert models to obtain corresponding 3D representations of concepts. 2) To compose these representations, we prompt multi-modal LLM to produce coarse guidance on the scales and coordinates of trajectories for the objects. 3) To make the generated frames adhere to natural image distribution, we further leverage 2D diffusion priors and use Score Distillation Sampling to refine the composition. Extensive experiments demonstrate that our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept. Project page: https://aka.ms/c3v.

LLMディレクターを用いた構成的3D認識ビデオ生成

Compositional 3D-aware Video Generation with LLM Director

要旨

Support