Video-LLaMA: ビデオ理解のための指示チューニングされた音声視覚言語モデル

要旨

私たちは、ビデオ内の視覚的および聴覚的コンテンツを理解する能力を大規模言語モデル（LLMs）に付与するマルチモーダルフレームワークであるVideo-LLaMAを紹介します。Video-LLaMAは、凍結された事前学習済みの視覚および音声エンコーダと凍結されたLLMsからクロスモーダルトレーニングをブートストラップします。MiniGPT-4~zhu2023minigptやLLaVA~liu2023visualitなどの静的な画像理解に焦点を当てた従来のビジョンLLMsとは異なり、Video-LLaMAはビデオ理解における2つの課題に取り組みます：（1）視覚シーンの時間的変化の捕捉、（2）音声と視覚の信号の統合。最初の課題に対して、事前学習済みの画像エンコーダをビデオエンコーダに拡張するためのVideo Q-formerを提案し、ビデオと言語の対応関係を学習するためのビデオからテキスト生成タスクを導入します。2番目の課題に対して、異なるモダリティを共通の埋め込み空間に整列させることに優れた性能を発揮する事前学習済みの音声エンコーダとしてImageBind~girdhar2023imagebindを活用し、聴覚クエリトークンを学習するためのAudio Q-formerを導入します。視覚および音声エンコーダの出力をLLMの埋め込み空間に整列させるために、大規模なビジョンキャプションデータセットと高品質なビジョンインストラクションチューニングデータセットでVideo-LLaMAをトレーニングします。Video-LLaMAはビデオコンテンツを感知し理解し、ビデオ内の視覚および聴覚情報に基づいた意味のある応答を生成する能力を示しました。これは、Video-LLaMAが音声と視覚のAIアシスタントの有望なプロトタイプとしての可能性を強調しています。私たちのコード、事前学習済みモデル、デモはhttps://github.com/DAMO-NLP-SG/Video-LLaMAで利用可能です。

English

We present Video-LLaMA, a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual \& audio encoders and the frozen LLMs. Unlike previous vision- LLMs that focus on static image comprehensions such as MiniGPT-4~zhu2023minigpt and LLaVA~liu2023visualit, Video-LLaMA tackles two challenges in video understanding: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. For the first challenge, we propose Video Q-former to extend the pre-trained image encoder to a video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind~girdhar2023imagebind as the pre-trained audio encoder which performs exceptionally well in aligning different modalities to a common embedding space. And then introduce an Audio Q-former to learn auditory query tokens. To align the output of both visual \& audio encoder with LLM's embedding space, we train Video-LLaMA on a large-scale vision caption dataset and a hign-quantity vision-instruction-tuning dataset. We found Video-LLaMA showcases the ability to perceive and comprehend video content, generating meaningful responses that are grounded in the visual and auditory information present in the videos. This highlights the potential of Video-LLaMA as a promising prototype for audio-visual AI assistants. Our code, pre-trained model, and demo are available at https://github.com/DAMO-NLP-SG/Video-LLaMA.

Video-LLaMA: ビデオ理解のための指示チューニングされた音声視覚言語モデル

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

要旨

Support