PG-Video-LLaVA:像素對應大型視訊語言模型
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
November 22, 2023
作者: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
cs.AI
摘要
將基於影像的大型多模型(LMM)擴展至視頻是具有挑戰性的,這是由於視頻數據的固有複雜性。最近將基於影像的LMM擴展至視頻的方法,要麼缺乏基礎能力(例如VideoChat、Video-ChatGPT、Video-LLaMA),要麼未利用音訊信號以提高對視頻的理解能力(例如Video-ChatGPT)。為解決這些問題,我們提出了Video-LLaVA,這是第一個具有像素級基礎能力的LMM,通過將音訊提示轉錄為文本以豐富視頻上下文理解。我們的框架使用現成的追踪器和一個新穎的基礎模組,使其能夠根據用戶指令在視頻中空間和時間地定位物體。我們使用基於視頻的生成和問答基準來評估Video-LLaVA,並引入了新的基準,專門設計來衡量基於提示的視頻物體基礎性能。此外,我們提出在基於視頻的對話基準測試中使用Vicuna取代Video-ChatGPT中使用的GPT-3.5,以確保結果的可重現性,這是由於GPT-3.5的專有性質而引起的擔憂。我們的框架基於最先進的基於影像的LLaVA模型,將其優勢擴展到視頻領域,為基於視頻的對話和基礎性任務帶來了顯著的收益。專案頁面:https://github.com/mbzuai-oryx/Video-LLaVA
English
Extending image-based Large Multimodal Models (LMM) to videos is challenging
due to the inherent complexity of video data. The recent approaches extending
image-based LMM to videos either lack the grounding capabilities (e.g.,
VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for
better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we
propose Video-LLaVA, the first LMM with pixel-level grounding capability,
integrating audio cues by transcribing them into text to enrich video-context
understanding. Our framework uses an off-the-shelf tracker and a novel
grounding module, enabling it to spatially and temporally localize objects in
videos following user instructions. We evaluate Video-LLaVA using video-based
generative and question-answering benchmarks and introduce new benchmarks
specifically designed to measure prompt-based object grounding performance in
videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in
Video-ChatGPT, for video-based conversation benchmarking, ensuring
reproducibility of results which is a concern with the proprietary nature of
GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its
advantages to the video domain, delivering promising gains on video-based
conversation and grounding tasks. Project Page:
https://github.com/mbzuai-oryx/Video-LLaVA