VideoDrafter:使用LLM实现内容一致的多场景视频生成
VideoDrafter: Content-Consistent Multi-Scene Video Generation with LLM
January 2, 2024
作者: Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei
cs.AI
摘要
最近扩散模型的创新和突破显著扩展了为给定提示生成高质量视频的可能性。大多数现有作品处理单场景情景,仅有一个视频事件发生在单一背景中。然而,扩展到生成多场景视频并非简单,并需要在保持视频场景中关键内容的一致视觉外观的同时,巧妙地处理场景之间的逻辑。本文提出了一种新颖的框架,名为VideoDrafter,用于内容一致的多场景视频生成。技术上,VideoDrafter利用大型语言模型(LLM)将输入提示转换为全面的多场景脚本,从中受益于LLM学到的逻辑知识。每个场景的脚本包括描述事件的提示、前景/背景实体以及摄像机移动。VideoDrafter识别整个脚本中的共同实体,并要求LLM详细描述每个实体。然后,将生成的实体描述输入到文本到图像模型中,为每个实体生成一个参考图像。最后,VideoDrafter通过扩散过程输出多场景视频,考虑了参考图像、事件的描述提示和摄像机移动。扩散模型将参考图像作为条件和对齐,以加强多场景视频内容的一致性。大量实验证明,VideoDrafter在视觉质量、内容一致性和用户偏好方面优于SOTA视频生成模型。
English
The recent innovations and breakthroughs in diffusion models have
significantly expanded the possibilities of generating high-quality videos for
the given prompts. Most existing works tackle the single-scene scenario with
only one video event occurring in a single background. Extending to generate
multi-scene videos nevertheless is not trivial and necessitates to nicely
manage the logic in between while preserving the consistent visual appearance
of key content across video scenes. In this paper, we propose a novel
framework, namely VideoDrafter, for content-consistent multi-scene video
generation. Technically, VideoDrafter leverages Large Language Models (LLM) to
convert the input prompt into comprehensive multi-scene script that benefits
from the logical knowledge learnt by LLM. The script for each scene includes a
prompt describing the event, the foreground/background entities, as well as
camera movement. VideoDrafter identifies the common entities throughout the
script and asks LLM to detail each entity. The resultant entity description is
then fed into a text-to-image model to generate a reference image for each
entity. Finally, VideoDrafter outputs a multi-scene video by generating each
scene video via a diffusion process that takes the reference images, the
descriptive prompt of the event and camera movement into account. The diffusion
model incorporates the reference images as the condition and alignment to
strengthen the content consistency of multi-scene videos. Extensive experiments
demonstrate that VideoDrafter outperforms the SOTA video generation models in
terms of visual quality, content consistency, and user preference.