LLM4VG: ビデオグラウンディングのための大規模言語モデル評価

要旨

近年、研究者たちは大規模言語モデル（LLM）が動画を扱う能力を調査しようと試み、いくつかの動画LLMモデルを提案してきた。しかし、LLMがビデオグラウンディング（VG）を扱う能力、すなわち与えられたテキストクエリに一致する動画内の時間的な瞬間の開始および終了タイムスタンプを正確に特定することを要求する重要な時間関連の動画タスクについては、依然として不明瞭であり、文献上も未探求のままである。このギャップを埋めるため、本論文ではLLM4VGベンチマークを提案し、異なるLLMのビデオグラウンディングタスクにおける性能を体系的に評価する。提案したLLM4VGに基づき、ビデオグラウンディングに関する2つのグループの動画LLMモデルを検証するための広範な実験を設計した：（i）テキストと動画のペアで訓練された動画LLM（VidLLMと表記）、および（ii）事前訓練された視覚記述モデル（例えば、動画/画像キャプションモデル）と組み合わせたLLMである。VGの指示と、キャプションベースのジェネレータによる直接的な視覚記述やVQAベースのジェネレータによる情報強化を含む、異なる種類のジェネレータからの記述を統合するためのプロンプト手法を提案する。また、様々なVidLLMの包括的な比較を提供し、視覚モデル、LLM、プロンプト設計などの異なる選択の影響も探求する。我々の実験的評価から得られた結論は以下の2点である：（i）既存のVidLLMは、満足のいくビデオグラウンディング性能を達成するには程遠く、これらのモデルをさらに微調整するためにより多くの時間関連の動画タスクを含めるべきであること、（ii）LLMと視覚モデルの組み合わせは、ビデオグラウンディングにおいて予備的な能力を示し、より信頼性の高いモデルとプロンプト指示のさらなるガイダンスによって改善の余地が大きいことである。

English

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.

LLM4VG: ビデオグラウンディングのための大規模言語モデル評価

LLM4VG: Large Language Models Evaluation for Video Grounding

要旨

Support