VideoDirectorGPT：通过LLM引导的规划实现一致的多场景视频生成

摘要

尽管最近的文本到视频（T2V）生成方法取得了显著进展，但大多数作品集中于生成单一事件和单一背景（即单场景视频）的短视频剪辑。与此同时，最近的大型语言模型（LLMs）展示了它们在生成布局和程序以控制下游视觉模块（如图像生成模型）方面的能力。这引发了一个重要问题：我们能否利用这些LLMs中嵌入的知识来进行时间上连贯的长视频生成？在本文中，我们提出了VideoDirectorGPT，这是一个用于一致的多场景视频生成的新框架，利用LLMs的知识进行视频内容规划和基于实际的视频生成。具体而言，给定一个单一文本提示，我们首先请我们的视频规划LLM（GPT-4）将其扩展为一个“视频计划”，其中包括生成场景描述、实体及其相应布局、每个场景的背景，以及实体和背景的一致性分组。接下来，在视频规划器的输出指导下，我们的视频生成器Layout2Vid 可以明确控制空间布局，并能保持跨场景的实体/背景的时间一致性，同时仅通过图像级注释进行训练。我们的实验表明，VideoDirectorGPT框架在单场景和多场景视频生成中显著改善了布局和运动控制，并能生成在各个场景之间视觉上连贯的多场景视频，同时在开放域单场景T2V生成中取得了竞争性的性能。我们还展示了我们的框架可以动态控制布局指导的强度，并且还可以生成包含用户提供图像的视频。我们希望我们的框架能激发未来更好地将LLMs的规划能力整合到一致的长视频生成中的工作。

English

Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.

VideoDirectorGPT：通过LLM引导的规划实现一致的多场景视频生成

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

摘要

Support