Video-LaVIT:具有解耦視覺-動態標記化的統一視頻-語言預訓練
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
February 5, 2024
作者: Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu
cs.AI
摘要
鑒於多模式大型語言模型(LLM)的最新進展,人們越來越關注將其從圖像文本數據擴展到更具信息量的現實世界視頻。與靜態圖像相比,視頻對於有效的大規模預訓練提出了獨特挑戰,原因在於需要對其時空動態進行建模。本文通過一種高效的視頻分解方法來解決視頻語言預訓練中的這些限制,該方法將每個視頻表示為關鍵幀和時間運動。然後,通過設計良好的分詞器將其適應到LLM中,將視覺和時間信息離散化為少量標記,從而實現視頻、圖像和文本的統一生成式預訓練。在推斷階段,從LLM生成的標記被精心還原到原始連續像素空間,以創建各種視頻內容。我們提出的框架既能理解又能生成圖像和視頻內容,這一點在13個圖像和視頻理解和生成的多模式基準測試中得到了證實。我們的代碼和模型將在https://video-lavit.github.io 上提供。
English
In light of recent advances in multimodal Large Language Models (LLMs), there
is increasing attention to scaling them from image-text data to more
informative real-world videos. Compared to static images, video poses unique
challenges for effective large-scale pre-training due to the modeling of its
spatiotemporal dynamics. In this paper, we address such limitations in
video-language pre-training with an efficient video decomposition that
represents each video as keyframes and temporal motions. These are then adapted
to an LLM using well-designed tokenizers that discretize visual and temporal
information as a few tokens, thus enabling unified generative pre-training of
videos, images, and text. At inference, the generated tokens from the LLM are
carefully recovered to the original continuous pixel space to create various
video content. Our proposed framework is both capable of comprehending and
generating image and video content, as demonstrated by its competitive
performance across 13 multimodal benchmarks in image and video understanding
and generation. Our code and models will be available at
https://video-lavit.github.io.