VideoPoet: ゼロショット動画生成のための大規模言語モデル

要旨

本論文では、多様な条件付け信号から高品質な映像とそれにマッチする音声を合成可能な言語モデル「VideoPoet」を紹介します。VideoPoetは、画像、映像、テキスト、音声といったマルチモーダル入力を処理するデコーダのみのTransformerアーキテクチャを採用しています。訓練プロトコルは大規模言語モデル（LLM）と同様に、事前学習とタスク固有の適応の2段階で構成されています。事前学習段階では、VideoPoetは自己回帰型Transformerフレームワーク内でマルチモーダル生成目標の混合を取り入れます。事前学習済みのLLMは、様々な映像生成タスクに適応可能な基盤として機能します。本論文では、ゼロショット映像生成におけるモデルの最先端性能を示す実証結果を提示し、特にVideoPoetが高忠実度の動きを生成する能力に焦点を当てています。プロジェクトページ: http://sites.research.google/videopoet/

English

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

VideoPoet: ゼロショット動画生成のための大規模言語モデル

VideoPoet: A Large Language Model for Zero-Shot Video Generation

要旨

Support