Video-LaVIT: 시각-운동 토큰화 분리를 통한 통합 비디오-언어 사전 학습

초록

최근 멀티모달 대형 언어 모델(LLM)의 발전에 따라, 이미지-텍스트 데이터에서 더욱 정보가 풍부한 실제 세계의 비디오로 확장하는 데 대한 관심이 높아지고 있습니다. 정적 이미지와 비교하여 비디오는 시공간 역학을 모델링해야 하기 때문에 대규모 사전 학습에 있어 독특한 도전 과제를 제시합니다. 본 논문에서는 각 비디오를 키프레임과 시간적 움직임으로 표현하는 효율적인 비디오 분해를 통해 비디오-언어 사전 학습의 이러한 한계를 해결합니다. 이를 위해 시각적 및 시간적 정보를 소수의 토큰으로 이산화하는 잘 설계된 토크나이저를 사용하여 LLM에 적응시킴으로써 비디오, 이미지, 텍스트의 통합 생성적 사전 학습을 가능하게 합니다. 추론 단계에서는 LLM에서 생성된 토큰을 원래의 연속 픽셀 공간으로 신중하게 복원하여 다양한 비디오 콘텐츠를 생성합니다. 우리가 제안한 프레임워크는 이미지와 비디오 콘텐츠를 이해하고 생성할 수 있는 능력을 갖추고 있으며, 이미지와 비디오 이해 및 생성 분야의 13개 멀티모달 벤치마크에서 경쟁력 있는 성능을 보여줍니다. 우리의 코드와 모델은 https://video-lavit.github.io에서 확인할 수 있습니다.

English

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models will be available at https://video-lavit.github.io.

Video-LaVIT: 시각-운동 토큰화 분리를 통한 통합 비디오-언어 사전 학습

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

초록

Support