ChatPaper.aiChatPaper

PG-Video-LLaVA:像素级定位大型视频语言模型

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

November 22, 2023
作者: Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
cs.AI

摘要

将基于图像的大型多模态模型(LMM)扩展至视频领域面临挑战,这源于视频数据固有的复杂性。现有将图像LMM扩展至视频的方法存在两大局限:要么缺乏实体定位能力(如VideoChat、Video-ChatGPT、Video-LLaMA),要么未能利用音频信号增强视频理解(如Video-ChatGPT)。为弥补这些不足,我们提出首个具备像素级实体定位能力的视频多模态模型Video-LLaVA,通过将音频转录为文本来整合听觉线索以增强视频上下文理解。该框架采用现成的追踪器与新型定位模块,使其能够根据用户指令在视频中实现时空维度的物体定位。我们使用视频生成与问答基准对Video-LLaVA进行评估,并针对视频中基于提示的物体定位性能设计了全新测评基准。此外,相较于Video-ChatGPT采用的GPT-3.5,我们提议使用Vicuna进行视频对话基准测试,以确保结果可复现性——这对具有专有属性的GPT-3.5而言存在隐忧。本框架基于当前最先进的图像LLaVA模型构建,将其优势延伸至视频领域,在视频对话与实体定位任务上展现出显著提升。项目页面:https://github.com/mbzuai-oryx/Video-LLaVA
English
Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMM to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially and temporally localize objects in videos following user instructions. We evaluate Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA
PDF183February 8, 2026