PG-Video-LLaVA:像素級基礎大型影片語言模型
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
November 22, 2023
作者: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
cs.AI
摘要
將基於圖像的大型多模態模型擴展至影片領域極具挑戰性,這源於影片數據固有的複雜性。現有將圖像基礎LMM擴展到影片的方法存在兩大不足:部分模型缺乏實體定位能力(如VideoChat、Video-ChatGPT、Video-LLaMA),另一些則未利用音頻信號增強影片理解(如Video-ChatGPT)。為解決這些缺陷,我們提出首個具備像素級定位能力的大型多模態模型Video-LLaVA,通過將音頻線索轉錄為文本以豐富影片上下文理解。該框架採用現成追蹤器與創新定位模組,使其能根據用戶指令實現影片中物體的時空定位。我們使用影片生成與問答基準評估Video-LLaVA,並針對影片中基於提示的物體定位性能設計新基準。此外,我們主張採用Vicuna替代Video-ChatGPT使用的GPT-3.5進行影片對話基準測試,以解決GPT-3.5專有模型導致的結果不可復現問題。本框架基於現有最先進的圖像模型LLaVA,將其優勢延伸至影片領域,在影片對話與定位任務中展現出顯著提升。項目頁面:https://github.com/mbzuai-oryx/Video-LLaVA
English
Extending image-based Large Multimodal Models (LMM) to videos is challenging
due to the inherent complexity of video data. The recent approaches extending
image-based LMM to videos either lack the grounding capabilities (e.g.,
VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for
better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we
propose Video-LLaVA, the first LMM with pixel-level grounding capability,
integrating audio cues by transcribing them into text to enrich video-context
understanding. Our framework uses an off-the-shelf tracker and a novel
grounding module, enabling it to spatially and temporally localize objects in
videos following user instructions. We evaluate Video-LLaVA using video-based
generative and question-answering benchmarks and introduce new benchmarks
specifically designed to measure prompt-based object grounding performance in
videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in
Video-ChatGPT, for video-based conversation benchmarking, ensuring
reproducibility of results which is a concern with the proprietary nature of
GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its
advantages to the video domain, delivering promising gains on video-based
conversation and grounding tasks. Project Page:
https://github.com/mbzuai-oryx/Video-LLaVA