LLM4VG：大型语言模型在视频定位中的评估

摘要

最近，研究人员尝试调查大型语言模型（LLMs）在处理视频方面的能力，并提出了几种视频LLM模型。然而，LLMs处理视频对位（VG）的能力，即一个重要的与时间相关的视频任务，要求模型精确定位视频中与给定文本查询相匹配的时间段的起始和结束时间戳，目前在文献中仍然不清楚且未被探索。为了填补这一空白，在本文中，我们提出了LLM4VG基准，系统评估不同LLMs在视频对位任务上的表现。基于我们提出的LLM4VG，我们设计了大量实验，以检验两组视频LLM模型在视频对位上的表现：（i）在文本-视频配对上训练的视频LLMs（简称为VidLLM），以及（ii）与预训练视觉描述模型（如视频/图像字幕模型）相结合的LLMs。我们提出了整合VG指导和来自不同类型生成器的描述的提示方法，包括用于直接视觉描述的基于字幕的生成器和用于信息增强的基于VQA的生成器。我们还对各种VidLLMs进行了全面比较，并探讨了不同视觉模型、LLMs、提示设计等选择的影响。我们的实验评估得出两个结论：（i）现有的VidLLMs仍远未达到令人满意的视频对位性能，应包括更多与时间相关的视频任务以进一步微调这些模型；（ii）LLMs与视觉模型的结合显示出对视频对位具有初步能力，并通过更可靠的模型和进一步指导的提示指令，有很大的改进潜力。

English

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.

LLM4VG：大型语言模型在视频定位中的评估

LLM4VG: Large Language Models Evaluation for Video Grounding

摘要

Support