Video-LLaMA：一個針對影片理解的指令調校音視覺語言模型

摘要

我們提出了Video-LLaMA，這是一個多模態框架，賦予大型語言模型（LLMs）理解視覺和聽覺內容的能力。Video-LLaMA從凍結的預訓練視覺和音頻編碼器以及凍結的LLMs中引導跨模態訓練。與先前專注於靜態圖像理解的視覺-LLMs（如MiniGPT-4~zhu2023minigpt和LLaVA~liu2023visualit）不同，Video-LLaMA應對了視頻理解中的兩個挑戰：（1）捕捉視覺場景中的時間變化，（2）整合視聽信號。對於第一個挑戰，我們提出Video Q-former將預訓練的圖像編碼器擴展為視頻編碼器，並引入一個視頻到文本生成任務來學習視頻語言對應關係。對於第二個挑戰，我們利用ImageBind~girdhar2023imagebind作為預訓練音頻編碼器，該編碼器在將不同模態對齊到共同嵌入空間方面表現出色。然後引入一個音頻Q-former來學習聽覺查詢標記。為了將視覺和音頻編碼器的輸出與LLM的嵌入空間對齊，我們在大規模視覺字幕數據集和高量視覺指令調整數據集上訓練Video-LLaMA。我們發現Video-LLaMA展示了感知和理解視頻內容的能力，生成有意義的回應，這些回應基於視頻中存在的視覺和聽覺信息。這突顯了Video-LLaMA作為視聽AI助手的潛在應用。我們的代碼、預訓練模型和演示可在https://github.com/DAMO-NLP-SG/Video-LLaMA找到。

English

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual \& audio encoders and the frozen LLMs. Unlike previous vision- LLMs that focus on static image comprehensions such as MiniGPT-4~zhu2023minigpt and LLaVA~liu2023visualit, Video-LLaMA tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. For the first challenge, we propose Video Q-former to extend the pre-trained image encoder to a video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind~girdhar2023imagebind as the pre-trained audio encoder which performs exceptionally well in aligning different modalities to a common embedding space. And then introduce an Audio Q-former to learn auditory query tokens. To align the output of both visual \& audio encoder with LLM's embedding space, we train Video-LLaMA on a large-scale vision caption dataset and a hign-quantity vision-instruction-tuning dataset. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information present in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants. Our code, pre-trained model, and demo are available at https://github.com/DAMO-NLP-SG/Video-LLaMA.

Video-LLaMA：一個針對影片理解的指令調校音視覺語言模型

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

摘要

Support