PG-Video-LLaVA: ピクセルグラウンディング大規模ビデオ言語モデル

要旨

画像ベースの大規模マルチモーダルモデル（LMM）を動画に拡張することは、動画データの本質的な複雑さから課題となっています。最近の画像ベースLMMを動画に拡張するアプローチは、グラウンディング能力を欠いている（例：VideoChat、Video-ChatGPT、Video-LLaMA）か、動画理解を向上させるために音声信号を活用していない（例：Video-ChatGPT）かのいずれかです。これらのギャップを埋めるため、我々はピクセルレベルのグラウンディング能力を備えた初のLMMであるVideo-LLaVAを提案し、音声の手がかりをテキストに変換して動画コンテキストの理解を豊かにします。我々のフレームワークは、既存のトラッカーと新規のグラウンディングモジュールを使用し、ユーザーの指示に従って動画内のオブジェクトを空間的および時間的にローカライズすることを可能にします。Video-LLaVAを動画ベースの生成および質問応答ベンチマークで評価し、動画内でのプロンプトベースのオブジェクトグラウンディング性能を測定するために特別に設計された新しいベンチマークを導入します。さらに、Video-ChatGPTで使用されているGPT-3.5の代わりにVicunaを使用することを提案し、GPT-3.5のプロプライエタリな性質による結果の再現性の問題を解決します。我々のフレームワークは、SoTAの画像ベースLLaVAモデルを基盤とし、その利点を動画領域に拡張し、動画ベースの会話およびグラウンディングタスクにおいて有望な成果を提供します。プロジェクトページ：https://github.com/mbzuai-oryx/Video-LLaVA

English

Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMM to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially and temporally localize objects in videos following user instructions. We evaluate Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA

PG-Video-LLaVA: ピクセルグラウンディング大規模ビデオ言語モデル

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

要旨

Support