Video-LaVIT: 使用分离的视觉-运动标记统一进行视频-语言预训练

摘要

鉴于多模式大型语言模型（LLMs）的最新进展，人们越来越关注将其从图像文本数据扩展到更具信息量的真实世界视频。与静态图像相比，视频对于有效的大规模预训练提出了独特挑战，因为需要对其时空动态进行建模。本文针对视频语言预训练中的这些限制，提出了一种高效的视频分解方法，将每个视频表示为关键帧和时间运动。然后，利用精心设计的分词器将其调整到LLM，将视觉和时间信息离散化为少量标记，从而实现视频、图像和文本的统一生成预训练。在推断阶段，从LLM生成的标记被精心恢复到原始连续像素空间，以创建各种视频内容。我们提出的框架既能理解又能生成图像和视频内容，通过在图像和视频理解与生成的13个多模态基准测试中展示出的竞争性表现加以证明。我们的代码和模型将在https://video-lavit.github.io 上提供。

English

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models will be available at https://video-lavit.github.io.

Video-LaVIT: 使用分离的视觉-运动标记统一进行视频-语言预训练

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

摘要

Support