ゼロショット時空間ビデオグラウンディングにおけるマルチモーダルLLMの可能性を解き放つ

要旨

時空間的ビデオグラウンディング（STVG）は、入力テキストクエリで指定されたビデオの時空間チューブを特定することを目的としています。本論文では、マルチモーダル大規模言語モデル（MLLMs）を活用して、STVGにおけるゼロショットソリューションを探求します。MLLMsに関する2つの重要な洞察を明らかにしました：（1）MLLMsは、テキストクエリをグラウンディングするために、グラウンディングトークンと呼ばれる特別なトークンを動的に割り当てる傾向がある；（2）MLLMsは、テキストクエリ内の手がかり（例：属性、アクション）を完全に統合して推論することができないため、しばしば最適でないグラウンディングに陥る。これらの洞察に基づき、MLLMベースのゼロショットSTVGフレームワークを提案します。このフレームワークには、MLLMsの推論能力を引き出すための新しい分解型時空間ハイライト（DSTH）と時間拡張アセンブリング（TAS）戦略が含まれています。DSTH戦略では、まず元のクエリを属性とアクションのサブクエリに分解し、空間的および時間的にターゲットの存在を問い合わせます。次に、新しいロジットガイド付き再注意（LRA）モジュールを使用して、各サブクエリのトークン予測を正則化することで、空間的および時間的プロンプトとして潜在変数を学習します。これらのプロンプトは、それぞれ属性とアクションの手がかりを強調し、モデルの注意を信頼性の高い空間的および時間的に関連する視覚領域に導きます。さらに、属性サブクエリによる空間的グラウンディングは時間的に一貫しているべきであるため、TAS戦略を導入して、元のビデオフレームと時間拡張フレームを入力として予測をアセンブルし、時間的一貫性を向上させます。我々の手法を様々なMLLMsで評価し、3つの一般的なSTVGベンチマークでSOTA手法を上回ることを示します。コードはhttps://github.com/zaiquanyang/LLaVA_Next_STVGで公開予定です。

English

Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as grounding tokens, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (e.g., attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.

ゼロショット時空間ビデオグラウンディングにおけるマルチモーダルLLMの可能性を解き放つ

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

要旨

Support