VideoPoet：一個用於零樣本視頻生成的大型語言模型

摘要

我們介紹了 VideoPoet，一種能夠從各種條件信號中合成高質量視頻並配有相應音頻的語言模型。VideoPoet採用僅解碼器的Transformer架構，處理多模態輸入，包括圖像、視頻、文本和音頻。訓練協議遵循大型語言模型（LLMs）的方式，包括兩個階段：預訓練和任務特定適應。在預訓練期間，VideoPoet在自回歸Transformer框架中結合多模態生成目標。預訓練的LLM作為基礎，可適應各種視頻生成任務。我們提供了實證結果，展示了該模型在零樣本視頻生成方面的最新能力，特別突出了VideoPoet生成高保真運動的能力。項目頁面：http://sites.research.google/videopoet/

English

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

VideoPoet：一個用於零樣本視頻生成的大型語言模型

VideoPoet: A Large Language Model for Zero-Shot Video Generation

摘要

Support