제로샷 시공간 비디오 그라운딩을 위한 멀티모달 LLM의 잠재력 발휘

초록

시공간 비디오 그라운딩(STVG)은 입력 텍스트 쿼리에 의해 지정된 비디오의 시공간 튜브를 지역화하는 것을 목표로 합니다. 본 논문에서는 다중모달 대형 언어 모델(MLLMs)을 활용하여 STVG에서의 제로샷 솔루션을 탐구합니다. 우리는 MLLMs에 대한 두 가지 주요 통찰을 밝혀냈습니다: (1) MLLMs는 텍스트 쿼리를 그라운딩하기 위해 그라운딩 토큰이라고 불리는 특수 토큰을 동적으로 할당하는 경향이 있으며, (2) MLLMs는 텍스트 쿼리의 단서(예: 속성, 행동)를 완전히 통합하여 추론하는 데 어려움을 겪어 최적의 그라운딩을 달성하지 못하는 경우가 많습니다. 이러한 통찰을 바탕으로, 우리는 MLLMs의 추론 능력을 극대화하기 위해 새로운 분해된 시공간 하이라이팅(DSTH)과 시간적 증강 조립(TAS) 전략을 포함한 MLLM 기반 제로샷 STVG 프레임워크를 제안합니다. DSTH 전략은 먼저 원래 쿼리를 속성과 행동 하위 쿼리로 분리하여 공간적 및 시간적으로 대상의 존재를 조사합니다. 그런 다음, 새로운 로짓 가이드 재어텐션(LRA) 모듈을 사용하여 각 하위 쿼리에 대한 토큰 예측을 정규화함으로써 공간적 및 시간적 프롬프트로 잠재 변수를 학습합니다. 이러한 프롬프트는 각각 속성과 행동 단서를 강조하여 모델의 주의를 신뢰할 수 있는 공간적 및 시간적 관련 시각적 영역으로 유도합니다. 또한, 속성 하위 쿼리에 의한 공간적 그라운딩은 시간적으로 일관되어야 하므로, 우리는 TAS 전략을 도입하여 원래 비디오 프레임과 시간적 증강 프레임을 입력으로 사용하여 예측을 조립함으로써 시간적 일관성을 개선합니다. 우리는 다양한 MLLMs에 대해 우리의 방법을 평가하고, 세 가지 일반적인 STVG 벤치마크에서 SOTA 방법을 능가하는 성능을 보여줍니다. 코드는 https://github.com/zaiquanyang/LLaVA_Next_STVG에서 확인할 수 있습니다.

English

Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as grounding tokens, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (e.g., attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.

제로샷 시공간 비디오 그라운딩을 위한 멀티모달 LLM의 잠재력 발휘

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

초록

Support