PG-Video-LLaVA：像素定位大型视频语言模型

摘要

将基于图像的大型多模态模型（LMM）扩展到视频领域是具有挑战性的，这是因为视频数据的固有复杂性。最近将基于图像的LMM扩展到视频的方法要么缺乏基础能力（例如VideoChat、Video-ChatGPT、Video-LLaMA），要么没有利用音频信号来更好地理解视频（例如Video-ChatGPT）。针对这些差距，我们提出了Video-LLaVA，这是第一个具有像素级基础能力的LMM，通过将音频线索转录为文本来丰富视频内容理解。我们的框架使用现成的跟踪器和一种新颖的基础模块，使其能够根据用户指令在视频中空间和时间上定位对象。我们使用基于视频的生成和问答基准测试评估Video-LLaVA，并引入了专门设计用于衡量视频中基于提示的对象基础性能的新基准测试。此外，我们提议在视频对话基准测试中使用Vicuna而不是Video-ChatGPT中使用的GPT-3.5，以确保结果的可重复性，这是由于GPT-3.5的专有性质引起的担忧。我们的框架建立在SoTA基于图像的LLaVA模型基础上，并将其优势扩展到视频领域，在视频对话和基础任务上取得了令人期待的收益。项目页面：https://github.com/mbzuai-oryx/Video-LLaVA

English

Extending image-based Large Multimodal Models (LMM) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMM to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially and temporally localize objects in videos following user instructions. We evaluate Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA

PG-Video-LLaVA：像素定位大型视频语言模型

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

摘要

Support