Grounded-VideoLLM:在视频中细化时间定位的大型语言模型
Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models
October 4, 2024
作者: Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang
cs.AI
摘要
视频大型语言模型(Video-LLMs)在粗粒度视频理解方面展现出卓越能力,然而,在细粒度时间定位方面存在困难。在本文中,我们介绍了Grounded-VideoLLM,这是一种新颖的视频-LLM,擅长以细粒度方式感知和推理特定视频时刻。我们发现当前的Video-LLMs在细粒度视频理解方面存在局限,因为它们缺乏有效的时间建模和时间戳表示。基于此,我们通过(1)引入额外的时间流以编码帧之间的关系和(2)使用富含特定时间知识的离散时间标记来表示时间戳来完善我们的模型。为了优化Grounded-VideoLLM的训练,我们采用了多阶段训练方案,从简单的视频字幕生成任务开始,逐渐引入越来越复杂的视频时间定位任务。为了进一步增强Grounded-VideoLLM的时间推理能力,我们还通过自动注释流程策划了一个基于实际情况的VideoQA数据集。大量实验证明,Grounded-VideoLLM不仅在细粒度定位任务(如时间句子定位、密集视频字幕生成和基于实际情况的VideoQA)方面表现出色,而且在作为通用视频理解的多功能视频助手方面显示出巨大潜力。
English
Video Large Language Models (Video-LLMs) have demonstrated remarkable
capabilities in coarse-grained video understanding, however, they struggle with
fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM,
a novel Video-LLM adept at perceiving and reasoning over specific video moments
in a fine-grained manner. We identify that current Video-LLMs have limitations
for fine-grained video understanding since they lack effective temporal
modeling and timestamp representation. In light of this, we sharpen our model
by incorporating (1) an additional temporal stream to encode the relationships
between frames and (2) discrete temporal tokens enriched with specific time
knowledge to represent timestamps. To optimize the training of
Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with
simple video-captioning tasks and progressively introducing video temporal
grounding tasks of increasing complexity. To further enhance
Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded
VideoQA dataset by an automatic annotation pipeline. Extensive experiments
demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding
tasks such as temporal sentence grounding, dense video captioning, and grounded
VideoQA, but also shows great potential as a versatile video assistant for
general video understanding.Summary
AI-Generated Summary