VideoPoet: 제로샷 비디오 생성을 위한 대규모 언어 모델

초록

우리는 다양한 조건 신호로부터 고품질의 비디오와 이를 매칭하는 오디오를 합성할 수 있는 언어 모델인 VideoPoet를 소개합니다. VideoPoet는 이미지, 비디오, 텍스트, 오디오를 포함한 다중 모달 입력을 처리하는 디코더 전용 트랜스포머 아키텍처를 사용합니다. 학습 프로토콜은 대형 언어 모델(LLM)과 유사하게 사전 학습과 작업별 적응의 두 단계로 구성됩니다. 사전 학습 단계에서 VideoPoet는 자기회귀적 트랜스포머 프레임워크 내에서 다중 모달 생성 목표를 혼합하여 통합합니다. 사전 학습된 LLM은 다양한 비디오 생성 작업에 적응할 수 있는 기반으로 사용됩니다. 우리는 제로샷 비디오 생성에서 모델의 최첨단 능력을 입증하는 실험 결과를 제시하며, 특히 VideoPoet가 고충실도 모션을 생성할 수 있는 능력을 강조합니다. 프로젝트 페이지: http://sites.research.google/videopoet/

English

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of Large Language Models (LLMs), consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

VideoPoet: 제로샷 비디오 생성을 위한 대규모 언어 모델

VideoPoet: A Large Language Model for Zero-Shot Video Generation

초록

Support