釋放多模態大語言模型在零樣本時空視頻定位中的潛力
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
September 18, 2025
作者: Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau
cs.AI
摘要
時空視頻定位(STVG)旨在根據輸入的文本查詢,定位視頻中的時空管道。本文利用多模態大語言模型(MLLMs)探索STVG中的零樣本解決方案。我們揭示了關於MLLMs的兩個關鍵洞察:(1)MLLMs傾向於動態分配特殊標記,稱為定位標記,用於定位文本查詢;(2)MLLMs由於無法完全整合文本查詢中的線索(例如屬性、動作)進行推理,往往導致次優的定位。基於這些洞察,我們提出了一個基於MLLM的零樣本STVG框架,其中包括新穎的分解時空高亮(DSTH)和時空增強組裝(TAS)策略,以釋放MLLMs的推理能力。DSTH策略首先將原始查詢解耦為屬性和動作子查詢,以在空間和時間上查詢目標的存在。然後,它使用一個新穎的對數引導重注意(LRA)模塊,通過正則化每個子查詢的標記預測,來學習潛在變量作為空間和時間提示。這些提示分別高亮屬性和動作線索,引導模型的注意力到可靠的空間和時間相關視覺區域。此外,由於屬性子查詢的空間定位應具有時間一致性,我們引入了TAS策略,使用原始視頻幀和時空增強幀作為輸入來組裝預測,以幫助提高時間一致性。我們在多種MLLMs上評估了我們的方法,並顯示其在三個常見的STVG基準測試中優於SOTA方法。代碼將在https://github.com/zaiquanyang/LLaVA_Next_STVG上提供。
English
Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal
tube of a video, as specified by the input text query. In this paper, we
utilize multimodal large language models (MLLMs) to explore a zero-shot
solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to
dynamically assign special tokens, referred to as grounding tokens,
for grounding the text query; and (2) MLLMs often suffer from suboptimal
grounding due to the inability to fully integrate the cues in the text query
(e.g., attributes, actions) for inference. Based on these insights, we
propose a MLLM-based zero-shot framework for STVG, which includes novel
decomposed spatio-temporal highlighting (DSTH) and temporal-augmented
assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH
strategy first decouples the original query into attribute and action
sub-queries for inquiring the existence of the target both spatially and
temporally. It then uses a novel logit-guided re-attention (LRA) module to
learn latent variables as spatial and temporal prompts, by regularizing token
predictions for each sub-query. These prompts highlight attribute and action
cues, respectively, directing the model's attention to reliable spatial and
temporal related visual regions. In addition, as the spatial grounding by the
attribute sub-query should be temporally consistent, we introduce the TAS
strategy to assemble the predictions using the original video frames and the
temporal-augmented frames as inputs to help improve temporal consistency. We
evaluate our method on various MLLMs, and show that it outperforms SOTA methods
on three common STVG benchmarks.
The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.