释放多模态大语言模型在零样本时空视频定位中的潜力
Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
September 18, 2025
作者: Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau
cs.AI
摘要
时空视频定位(STVG)旨在根据输入的文本查询定位视频中的时空管道。本文中,我们利用多模态大语言模型(MLLMs)探索STVG中的零样本解决方案。我们揭示了关于MLLMs的两个关键洞察:(1)MLLMs倾向于动态分配特殊标记,称为定位标记,用于定位文本查询;(2)MLLMs由于无法完全整合文本查询中的线索(如属性、动作)进行推理,常常导致定位效果不佳。基于这些洞察,我们提出了一个基于MLLM的零样本STVG框架,该框架包括新颖的分解时空高亮(DSTH)和时序增强组装(TAS)策略,以释放MLLMs的推理能力。DSTH策略首先将原始查询解耦为属性和动作子查询,以在空间和时间上询问目标的存在。然后,它使用一个新颖的logit引导重注意(LRA)模块,通过正则化每个子查询的标记预测来学习潜在变量作为空间和时序提示。这些提示分别突出属性和动作线索,引导模型关注可靠的空间和时序相关视觉区域。此外,由于属性子查询的空间定位应具有时序一致性,我们引入了TAS策略,通过使用原始视频帧和时序增强帧作为输入来组装预测,以帮助提高时序一致性。我们在多种MLLMs上评估了我们的方法,并展示了其在三个常见STVG基准测试上优于现有最先进方法的表现。
代码将发布于https://github.com/zaiquanyang/LLaVA_Next_STVG。
English
Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal
tube of a video, as specified by the input text query. In this paper, we
utilize multimodal large language models (MLLMs) to explore a zero-shot
solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to
dynamically assign special tokens, referred to as grounding tokens,
for grounding the text query; and (2) MLLMs often suffer from suboptimal
grounding due to the inability to fully integrate the cues in the text query
(e.g., attributes, actions) for inference. Based on these insights, we
propose a MLLM-based zero-shot framework for STVG, which includes novel
decomposed spatio-temporal highlighting (DSTH) and temporal-augmented
assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH
strategy first decouples the original query into attribute and action
sub-queries for inquiring the existence of the target both spatially and
temporally. It then uses a novel logit-guided re-attention (LRA) module to
learn latent variables as spatial and temporal prompts, by regularizing token
predictions for each sub-query. These prompts highlight attribute and action
cues, respectively, directing the model's attention to reliable spatial and
temporal related visual regions. In addition, as the spatial grounding by the
attribute sub-query should be temporally consistent, we introduce the TAS
strategy to assemble the predictions using the original video frames and the
temporal-augmented frames as inputs to help improve temporal consistency. We
evaluate our method on various MLLMs, and show that it outperforms SOTA methods
on three common STVG benchmarks.
The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.