LLM4VG: 비디오 그라운딩을 위한 대형 언어 모델 평가

초록

최근 연구자들은 대형 언어 모델(LLM)의 비디오 처리 능력을 조사하기 위해 여러 비디오 LLM 모델을 제안했습니다. 그러나 LLM의 비디오 그라운딩(VG) 처리 능력은 여전히 문헌에서 명확히 밝혀지지 않았으며 탐구되지 않은 상태입니다. 비디오 그라운딩은 주어진 텍스트 쿼리와 일치하는 비디오 내의 시간적 순간의 시작 및 종료 타임스탬프를 정확히 찾아내야 하는 중요한 시간 관련 비디오 작업입니다. 이러한 공백을 메우기 위해, 본 논문에서는 비디오 그라운딩 작업에서 다양한 LLM의 성능을 체계적으로 평가하는 LLM4VG 벤치마크를 제안합니다. 우리가 제안한 LLM4VG를 기반으로, 두 그룹의 비디오 LLM 모델을 비디오 그라운딩에서 검토하기 위한 광범위한 실험을 설계했습니다: (i) 텍스트-비디오 쌍으로 학습된 비디오 LLM(VidLLM), 그리고 (ii) 사전 학습된 시각적 설명 모델(예: 비디오/이미지 캡셔닝 모델)과 결합된 LLM. 우리는 VG 지시와 다양한 종류의 생성기로부터의 설명을 통합하기 위한 프롬프트 방법을 제안합니다. 여기에는 직접적인 시각적 설명을 위한 캡션 기반 생성기와 정보 강화를 위한 VQA 기반 생성기가 포함됩니다. 또한 다양한 VidLLM의 종합적인 비교를 제공하고, 시각 모델, LLM, 프롬프트 설계 등의 다양한 선택의 영향을 탐구합니다. 우리의 실험적 평가는 두 가지 결론을 도출합니다: (i) 기존의 VidLLM은 만족스러운 비디오 그라운딩 성능을 달성하기에는 아직 멀었으며, 이러한 모델을 더욱 세밀하게 조정하기 위해 더 많은 시간 관련 비디오 작업이 포함되어야 한다는 것, 그리고 (ii) LLM과 시각 모델의 결합은 비디오 그라운딩에 대한 초기 능력을 보여주며, 더 신뢰할 수 있는 모델과 프롬프트 지시의 추가적인 지도를 통해 개선의 상당한 잠재력을 가지고 있다는 것입니다.

English

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.

LLM4VG: 비디오 그라운딩을 위한 대형 언어 모델 평가

LLM4VG: Large Language Models Evaluation for Video Grounding

초록

Support