Video-LaVIT: 視覚-運動トークン化を分離した統一的なビデオ-言語事前学習

要旨

近年のマルチモーダル大規模言語モデル（LLM）の進展に伴い、画像とテキストのデータからより情報量の多い実世界の動画へとスケーリングすることに対する関心が高まっています。静止画像と比較して、動画はその時空間的ダイナミクスのモデリングにより、大規模な事前学習において独特の課題を提起します。本論文では、動画をキーフレームと時間的モーションとして表現する効率的な動画分解を用いて、動画と言語の事前学習におけるこれらの制約に対処します。これらは、視覚情報と時間情報を少数のトークンとして離散化するように設計されたトークナイザーを介してLLMに適応され、動画、画像、テキストの統一的な生成的事前学習を可能にします。推論時には、LLMから生成されたトークンは慎重に元の連続的なピクセル空間に復元され、様々な動画コンテンツを作成します。提案するフレームワークは、画像と動画のコンテンツを理解し生成する能力を備えており、画像と動画の理解および生成に関する13のマルチモーダルベンチマークでの競争力のある性能によってその有効性が実証されています。コードとモデルはhttps://video-lavit.github.ioで公開予定です。

English

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models will be available at https://video-lavit.github.io.

Video-LaVIT: 視覚-運動トークン化を分離した統一的なビデオ-言語事前学習

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

要旨

Support