VideoDirectorGPT：通過LLM引導的一致多場景視頻生成规划

摘要

儘管最近的文本轉視頻（T2V）生成方法取得了顯著進展，但大多數這些研究專注於生成單一事件且具有單一背景（即單場景視頻）的短視頻片段。與此同時，最近的大型語言模型（LLMs）展示了它們在生成布局和控制下游視覺模塊（如圖像生成模型）方面的能力。這引出了一個重要問題：我們是否可以利用這些LLMs中嵌入的知識來進行時間上一致的長視頻生成？在本文中，我們提出了VideoDirectorGPT，這是一個新穎的框架，用於實現一致的多場景視頻生成，利用LLMs的知識進行視頻內容規劃和基於實際情況的視頻生成。具體而言，給定一個單一文本提示，我們首先請我們的視頻規劃LLM（GPT-4）將其擴展為“視頻計劃”，其中包括生成場景描述、實體及其相應佈局、每個場景的背景，以及實體和背景的一致性分組。接著，在視頻規劃的輸出指導下，我們的視頻生成器Layout2Vid 可以明確控制空間佈局，並能在各場景間保持實體/背景的時間一致性，同時僅通過圖像級標註進行訓練。我們的實驗表明，VideoDirectorGPT框架在單場景和多場景視頻生成中顯著改善了佈局和運動控制，並能生成具有視覺一致性的多場景視頻，同時在開放域單場景T2V生成中表現出色。我們還展示了我們的框架可以動態控制佈局引導的強度，並且可以生成帶有用戶提供圖像的視頻。我們希望我們的框架能激發未來更好地將LLMs的規劃能力整合到一致的長視頻生成中的工作。

English

Although recent text-to-video (T2V) generation methods have seen significant advancements, most of these works focus on producing short video clips of a single event with a single background (i.e., single-scene videos). Meanwhile, recent large language models (LLMs) have demonstrated their capability in generating layouts and programs to control downstream visual modules such as image generation models. This raises an important question: can we leverage the knowledge embedded in these LLMs for temporally consistent long video generation? In this paper, we propose VideoDirectorGPT, a novel framework for consistent multi-scene video generation that uses the knowledge of LLMs for video content planning and grounded video generation. Specifically, given a single text prompt, we first ask our video planner LLM (GPT-4) to expand it into a 'video plan', which involves generating the scene descriptions, the entities with their respective layouts, the background for each scene, and consistency groupings of the entities and backgrounds. Next, guided by this output from the video planner, our video generator, Layout2Vid, has explicit control over spatial layouts and can maintain temporal consistency of entities/backgrounds across scenes, while only trained with image-level annotations. Our experiments demonstrate that VideoDirectorGPT framework substantially improves layout and movement control in both single- and multi-scene video generation and can generate multi-scene videos with visual consistency across scenes, while achieving competitive performance with SOTAs in open-domain single-scene T2V generation. We also demonstrate that our framework can dynamically control the strength for layout guidance and can also generate videos with user-provided images. We hope our framework can inspire future work on better integrating the planning ability of LLMs into consistent long video generation.

VideoDirectorGPT：通過LLM引導的一致多場景視頻生成规划

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning

摘要

Support