具有LLM指导者的组合式三维感知视频生成

摘要

通过强大的生成模型和大规模互联网数据的使用，在文本到视频生成方面取得了显著进展。然而，在精确控制生成视频中的个别概念方面仍然存在重大挑战，例如特定角色的运动和外观以及视角的移动。在这项工作中，我们提出了一种新颖的范式，即分别生成3D表示中的每个概念，然后结合来自大型语言模型（LLM）和2D扩散模型的先验知识。具体而言，给定一个输入文本提示，我们的方案包括三个阶段：1）我们利用LLM作为导演，首先将复杂查询分解为几个子提示，指示视频中的个别概念（例如场景、物体、动作），然后让LLM调用预训练的专家模型获取相应的概念3D表示。2）为了组合这些表示，我们提示多模态LLM产生关于物体轨迹的比例和坐标的粗略指导。3）为了使生成的帧符合自然图像分布，我们进一步利用2D扩散先验，并使用得分蒸馏采样来优化组合。大量实验证明，我们的方法能够从文本生成高保真度的视频，具有多样的运动和对每个概念的灵活控制。项目页面：https://aka.ms/c3v。

English

Significant progress has been made in text-to-video generation through the use of powerful generative models and large-scale internet data. However, substantial challenges remain in precisely controlling individual concepts within the generated video, such as the motion and appearance of specific characters and the movement of viewpoints. In this work, we propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models. Specifically, given an input textual prompt, our scheme consists of three stages: 1) We leverage LLM as the director to first decompose the complex query into several sub-prompts that indicate individual concepts within the video~(e.g., scene, objects, motions), then we let LLM to invoke pre-trained expert models to obtain corresponding 3D representations of concepts. 2) To compose these representations, we prompt multi-modal LLM to produce coarse guidance on the scales and coordinates of trajectories for the objects. 3) To make the generated frames adhere to natural image distribution, we further leverage 2D diffusion priors and use Score Distillation Sampling to refine the composition. Extensive experiments demonstrate that our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept. Project page: https://aka.ms/c3v.

具有LLM指导者的组合式三维感知视频生成

Compositional 3D-aware Video Generation with LLM Director

摘要

Support