Video-LaVIT：具有解耦視覺-動態標記化的統一視頻-語言預訓練

摘要

鑒於多模式大型語言模型（LLM）的最新進展，人們越來越關注將其從圖像文本數據擴展到更具信息量的現實世界視頻。與靜態圖像相比，視頻對於有效的大規模預訓練提出了獨特挑戰，原因在於需要對其時空動態進行建模。本文通過一種高效的視頻分解方法來解決視頻語言預訓練中的這些限制，該方法將每個視頻表示為關鍵幀和時間運動。然後，通過設計良好的分詞器將其適應到LLM中，將視覺和時間信息離散化為少量標記，從而實現視頻、圖像和文本的統一生成式預訓練。在推斷階段，從LLM生成的標記被精心還原到原始連續像素空間，以創建各種視頻內容。我們提出的框架既能理解又能生成圖像和視頻內容，這一點在13個圖像和視頻理解和生成的多模式基準測試中得到了證實。我們的代碼和模型將在https://video-lavit.github.io 上提供。

English

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models will be available at https://video-lavit.github.io.

Video-LaVIT：具有解耦視覺-動態標記化的統一視頻-語言預訓練

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

摘要

Support