VideoLLaMA 3:用於圖像和視頻理解的前沿多模態基礎模型
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
January 22, 2025
作者: Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao
cs.AI
摘要
本文提出了VideoLLaMA3,這是一個更先進的多模式基礎模型,用於圖像和視頻理解。VideoLLaMA3 的核心設計理念是以視覺為中心。"以視覺為中心" 的含義有兩個方面:視覺為中心的訓練範式和視覺為中心的框架設計。我們視覺為中心的訓練範式的關鍵見解是高質量的圖像文本數據對於圖像和視頻理解至關重要。我們專注於構建大規模且高質量的圖像文本數據集,而非準備大量的視頻文本數據集。VideoLLaMA3 有四個訓練階段:1)視覺為中心的對齊階段,用於啟動視覺編碼器和投影器;2)視覺語言預訓練階段,通過大規模圖像文本數據對視覺編碼器、投影器和LLM 進行聯合調整,包括多種類型的圖像文本數據(包括場景圖像、文件、圖表)以及僅文本數據;3)多任務微調階段,將圖像文本 SFT 數據納入下游任務,並將視頻文本數據納入以建立視頻理解的基礎;4)視頻為中心的微調,進一步提升模型在視頻理解方面的能力。至於框架設計,為了更好地捕捉圖像中的細節,預訓練的視覺編碼器被調整為將不同大小的圖像編碼為具有相應數量的視覺標記,而不是固定數量的標記。對於視頻輸入,我們根據它們的相似性減少視覺標記的數量,從而使視頻的表示更加精確和緊湊。受益於視覺為中心的設計,VideoLLaMA3 在圖像和視頻理解基準測試中取得了引人注目的表現。
English
In this paper, we propose VideoLLaMA3, a more advanced multimodal foundation
model for image and video understanding. The core design philosophy of
VideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: the
vision-centric training paradigm and vision-centric framework design. The key
insight of our vision-centric training paradigm is that high-quality image-text
data is crucial for both image and video understanding. Instead of preparing
massive video-text datasets, we focus on constructing large-scale and
high-quality image-text datasets. VideoLLaMA3 has four training stages: 1)
vision-centric alignment stage, which warms up the vision encoder and
projector; 2) vision-language pretraining stage, which jointly tunes the vision
encoder, projector, and LLM with large-scale image-text data covering multiple
types (including scene images, documents, charts) as well as text-only data. 3)
multi-task fine-tuning stage, which incorporates image-text SFT data for
downstream tasks and video-text data to establish a foundation for video
understanding. 4) video-centric fine-tuning, which further improves the model's
capability in video understanding. As for the framework design, to better
capture fine-grained details in images, the pretrained vision encoder is
adapted to encode images of varying sizes into vision tokens with corresponding
numbers, rather than a fixed number of tokens. For video inputs, we reduce the
number of vision tokens according to their similarity so that the
representation of videos will be more precise and compact. Benefit from
vision-centric designs, VideoLLaMA3 achieves compelling performances in both
image and video understanding benchmarks.Summary
AI-Generated Summary