Video-LLaMA：一种针对视频理解的指令调整的视听语言模型

摘要

我们提出了Video-LLaMA，这是一个多模态框架，赋予大型语言模型（LLMs）理解视频中的视觉和听觉内容的能力。Video-LLaMA通过从冻结的预训练视觉和音频编码器以及冻结的LLMs引导跨模态训练。与之前专注于静态图像理解的视觉-LLMs（如MiniGPT-4~zhu2023minigpt和LLaVA~liu2023visualit）不同，Video-LLaMA应对了视频理解中的两个挑战：（1）捕捉视觉场景中的时间变化，（2）集成视听信号。对于第一个挑战，我们提出了Video Q-former，将预训练的图像编码器扩展为视频编码器，并引入了一个视频到文本生成任务，以学习视频-语言对应关系。对于第二个挑战，我们利用ImageBind~girdhar2023imagebind作为预训练的音频编码器，其在将不同模态对齐到一个共同嵌入空间方面表现出色。然后引入一个音频Q-former来学习听觉查询标记。为了将视觉和音频编码器的输出与LLM的嵌入空间对齐，我们在大规模视觉字幕数据集和高数量的视觉指导微调数据集上训练了Video-LLaMA。我们发现Video-LLaMA展示了感知和理解视频内容的能力，生成的有意义响应根植于视频中的视觉和听觉信息。这突显了Video-LLaMA作为音视频AI助手的潜在潜力。我们的代码、预训练模型和演示可在https://github.com/DAMO-NLP-SG/Video-LLaMA找到。

English

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual \& audio encoders and the frozen LLMs. Unlike previous vision- LLMs that focus on static image comprehensions such as MiniGPT-4~zhu2023minigpt and LLaVA~liu2023visualit, Video-LLaMA tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. For the first challenge, we propose Video Q-former to extend the pre-trained image encoder to a video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind~girdhar2023imagebind as the pre-trained audio encoder which performs exceptionally well in aligning different modalities to a common embedding space. And then introduce an Audio Q-former to learn auditory query tokens. To align the output of both visual \& audio encoder with LLM's embedding space, we train Video-LLaMA on a large-scale vision caption dataset and a hign-quantity vision-instruction-tuning dataset. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information present in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants. Our code, pre-trained model, and demo are available at https://github.com/DAMO-NLP-SG/Video-LLaMA.

Video-LLaMA：一种针对视频理解的指令调整的视听语言模型

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

摘要

Support