Video-LaVIT: 使用分离的视觉-运动标记统一进行视频-语言预训练
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
February 5, 2024
作者: Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu
cs.AI
摘要
鉴于多模式大型语言模型(LLMs)的最新进展,人们越来越关注将其从图像文本数据扩展到更具信息量的真实世界视频。与静态图像相比,视频对于有效的大规模预训练提出了独特挑战,因为需要对其时空动态进行建模。本文针对视频语言预训练中的这些限制,提出了一种高效的视频分解方法,将每个视频表示为关键帧和时间运动。然后,利用精心设计的分词器将其调整到LLM,将视觉和时间信息离散化为少量标记,从而实现视频、图像和文本的统一生成预训练。在推断阶段,从LLM生成的标记被精心恢复到原始连续像素空间,以创建各种视频内容。我们提出的框架既能理解又能生成图像和视频内容,通过在图像和视频理解与生成的13个多模态基准测试中展示出的竞争性表现加以证明。我们的代码和模型将在https://video-lavit.github.io 上提供。
English
In light of recent advances in multimodal Large Language Models (LLMs), there
is increasing attention to scaling them from image-text data to more
informative real-world videos. Compared to static images, video poses unique
challenges for effective large-scale pre-training due to the modeling of its
spatiotemporal dynamics. In this paper, we address such limitations in
video-language pre-training with an efficient video decomposition that
represents each video as keyframes and temporal motions. These are then adapted
to an LLM using well-designed tokenizers that discretize visual and temporal
information as a few tokens, thus enabling unified generative pre-training of
videos, images, and text. At inference, the generated tokens from the LLM are
carefully recovered to the original continuous pixel space to create various
video content. Our proposed framework is both capable of comprehending and
generating image and video content, as demonstrated by its competitive
performance across 13 multimodal benchmarks in image and video understanding
and generation. Our code and models will be available at
https://video-lavit.github.io.