LLM4VG:用於視頻定位的大型語言模型評估
LLM4VG: Large Language Models Evaluation for Video Grounding
December 21, 2023
作者: Wei Feng, Xin Wang, Hong Chen, Zeyang Zhang, Zihan Song, Yuwei Zhou, Wenwu Zhu
cs.AI
摘要
最近,研究人員嘗試探討大型語言模型(LLMs)處理影片的能力,並提出了幾種影片LLM模型。然而,LLMs處理影片對齊(VG)的能力,即一項重要的與時間相關的影片任務,需要模型精確定位影片中符合給定文本查詢的時間片段的起始和結束時間戳,仍然在文獻中尚不清楚且未被探索。為填補這一空白,本文提出了LLM4VG基準,系統評估不同LLMs在影片對齊任務上的表現。基於我們提出的LLM4VG,我們設計了廣泛的實驗,以檢驗兩組影片LLM模型在影片對齊上的表現:(i)在文本-影片配對上訓練的影片LLMs(簡稱為VidLLM),以及(ii)結合預訓練視覺描述模型(如影片/圖像字幕模型)的LLMs。我們提出了整合VG指示和來自不同類型生成器的描述的方法,包括基於字幕的生成器用於直接視覺描述,以及基於VQA的生成器用於信息增強。我們還對各種VidLLMs進行了全面比較,並探討了不同視覺模型、LLMs、提示設計等的影響。我們的實驗評估得出兩個結論:(i)現有的VidLLMs仍遠遠未能達到令人滿意的影片對齊表現,應該包含更多與時間相關的影片任務來進一步微調這些模型,以及(ii)LLMs與視覺模型的結合展現了對影片對齊的初步能力,通過採用更可靠的模型和進一步引導提示指令,有望實現更大的改進。
English
Recently, researchers have attempted to investigate the capability of LLMs in
handling videos and proposed several video LLM models. However, the ability of
LLMs to handle video grounding (VG), which is an important time-related video
task requiring the model to precisely locate the start and end timestamps of
temporal moments in videos that match the given textual queries, still remains
unclear and unexplored in literature. To fill the gap, in this paper, we
propose the LLM4VG benchmark, which systematically evaluates the performance of
different LLMs on video grounding tasks. Based on our proposed LLM4VG, we
design extensive experiments to examine two groups of video LLM models on video
grounding: (i) the video LLMs trained on the text-video pairs (denoted as
VidLLM), and (ii) the LLMs combined with pretrained visual description models
such as the video/image captioning model. We propose prompt methods to
integrate the instruction of VG and description from different kinds of
generators, including caption-based generators for direct visual description
and VQA-based generators for information enhancement. We also provide
comprehensive comparisons of various VidLLMs and explore the influence of
different choices of visual models, LLMs, prompt designs, etc, as well. Our
experimental evaluations lead to two conclusions: (i) the existing VidLLMs are
still far away from achieving satisfactory video grounding performance, and
more time-related video tasks should be included to further fine-tune these
models, and (ii) the combination of LLMs and visual models shows preliminary
abilities for video grounding with considerable potential for improvement by
resorting to more reliable models and further guidance of prompt instructions.