VideoPoet：一种用于零样本视频生成的大型语言模型

摘要

我们提出了VideoPoet，这是一种能够从各种调节信号中合成高质量视频及匹配音频的语言模型。VideoPoet采用仅解码器的Transformer架构，处理多模态输入，包括图像、视频、文本和音频。训练协议遵循大型语言模型（LLMs）的方式，包括两个阶段：预训练和特定任务适应。在预训练阶段，VideoPoet在自回归Transformer框架中结合多模态生成目标。预训练的LLM作为一个基础，可用于各种视频生成任务的调整。我们提供了实证结果，展示了该模型在零样本视频生成方面的最新能力，特别突出了VideoPoet生成高保真运动的能力。项目页面：http://sites.research.google/videopoet/

English

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

VideoPoet：一种用于零样本视频生成的大型语言模型

VideoPoet: A Large Language Model for Zero-Shot Video Generation

摘要

Support