VideoDrafter：使用LLM實現內容一致的多場景視頻生成

摘要

最近擴散模型的創新和突破顯著擴展了為給定提示生成高質量視頻的可能性。大多數現有作品處理單場景情況，只有一個視頻事件發生在單一背景中。然而，擴展到生成多場景視頻並不簡單，需要妥善處理場景之間的邏輯，同時保持視頻場景中關鍵內容的一致視覺外觀。本文提出了一個新的框架，名為VideoDrafter，用於生成內容一致的多場景視頻。從技術上講，VideoDrafter利用大型語言模型（LLM）將輸入提示轉換為全面的多場景劇本，從LLM學習的邏輯知識中受益。每個場景的劇本包括描述事件的提示，前景/背景實體，以及攝像機運動。VideoDrafter識別整個劇本中的共同實體，並要求LLM詳細描述每個實體。然後，生成的實體描述被餵入文本到圖像模型，為每個實體生成參考圖像。最後，VideoDrafter通過考慮參考圖像、事件的描述提示和攝像機運動，通過擴散過程生成每個場景視頻，輸出多場景視頻。擴散模型將參考圖像作為條件和對齊，以加強多場景視頻的內容一致性。大量實驗表明，VideoDrafter在視覺質量、內容一致性和用戶偏好方面優於最先進的視頻生成模型。

English

The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoDrafter, for content-consistent multi-scene video generation. Technically, VideoDrafter leverages Large Language Models (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoDrafter identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoDrafter outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoDrafter outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference.

VideoDrafter：使用LLM實現內容一致的多場景視頻生成

VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM

摘要

Support