VideoPoet:一种用于零样本视频生成的大型语言模型
VideoPoet: A Large Language Model for Zero-Shot Video Generation
December 21, 2023
作者: Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, Yong Cheng, Ming-Chang Chiu, Josh Dillon, Irfan Essa, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, David Ross, Grant Schindler, Mikhail Sirotenko, Kihyuk Sohn, Krishna Somandepalli, Huisheng Wang, Jimmy Yan, Ming-Hsuan Yang, Xuan Yang, Bryan Seybold, Lu Jiang
cs.AI
摘要
我们提出了VideoPoet,这是一种能够从各种调节信号中合成高质量视频及匹配音频的语言模型。VideoPoet采用仅解码器的Transformer架构,处理多模态输入,包括图像、视频、文本和音频。训练协议遵循大型语言模型(LLMs)的方式,包括两个阶段:预训练和特定任务适应。在预训练阶段,VideoPoet在自回归Transformer框架中结合多模态生成目标。预训练的LLM作为一个基础,可用于各种视频生成任务的调整。我们提供了实证结果,展示了该模型在零样本视频生成方面的最新能力,特别突出了VideoPoet生成高保真运动的能力。项目页面:http://sites.research.google/videopoet/
English
We present VideoPoet, a language model capable of synthesizing high-quality
video, with matching audio, from a large variety of conditioning signals.
VideoPoet employs a decoder-only transformer architecture that processes
multimodal inputs -- including images, videos, text, and audio. The training
protocol follows that of Large Language Models (LLMs), consisting of two
stages: pretraining and task-specific adaptation. During pretraining, VideoPoet
incorporates a mixture of multimodal generative objectives within an
autoregressive Transformer framework. The pretrained LLM serves as a foundation
that can be adapted for a range of video generation tasks. We present empirical
results demonstrating the model's state-of-the-art capabilities in zero-shot
video generation, specifically highlighting VideoPoet's ability to generate
high-fidelity motions. Project page: http://sites.research.google/videopoet/