ChatPaper.aiChatPaper

Grounded-VideoLLM:在視頻中銳化細粒度時間對齊的大型語言模型

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

October 4, 2024
作者: Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang
cs.AI

摘要

影片大型語言模型(Video-LLMs)在粗粒度影片理解方面展現出卓越的能力,然而在細粒度時間定位方面卻遇到困難。本文介紹了一種新型的Grounded-VideoLLM,這是一種擅長以細緻方式感知和推理特定影片片段的Video-LLM。我們發現目前的Video-LLMs在細粒度影片理解方面存在限制,因為它們缺乏有效的時間建模和時間戳表示。基於此,我們通過(1)增加一個額外的時間流來編碼幀之間的關係,以及(2)使用富含特定時間知識的離散時間標記來表示時間戳,來改進我們的模型。為了優化Grounded-VideoLLM的訓練,我們採用了多階段訓練方案,從簡單的影片字幕任務開始,逐步引入越來越複雜的影片時間定位任務。為了進一步增強Grounded-VideoLLM的時間推理能力,我們還通過自動標註流程精心策劃了一個基於實際情況的VideoQA數據集。廣泛的實驗表明,Grounded-VideoLLM不僅在細粒度定位任務(如時間句子定位、密集影片字幕和基於實際情況的VideoQA)方面表現出色,還展現了作為通用影片理解的多才多藝的影片助手的巨大潛力。
English
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

Summary

AI-Generated Summary

PDF72November 16, 2024