訓練不要のビデオ推論セグメンテーションのためのMLLMにおける分解的注意融合

要旨

マルチモーダル大規模言語モデル（MLLMs）は、テキストクエリに関連する視覚トークンに注目することで、強力なビデオ理解能力を示します。これをトレーニング不要な方法でローカライゼーションに直接適用するため、ビデオ推論セグメンテーションをビデオQAタスクとして捉え、ロールアウトメカニズムを通じてアテンションマップを抽出します。しかし、生のアテンションマップはノイズが多く、オブジェクト領域と適切に整合していません。そこで、我々はDecomposed Attention Fusion（DecAF）を提案し、以下の2つのメカニズムを通じてこれらのマップを精緻化します：(1) 対照的なオブジェクト-背景融合と、(2) 補完的なビデオフレーム融合。この方法により、無関係な活性化を抑制し、オブジェクトに焦点を当てた手がかりを強化し、アテンションマップを直接粗いセグメンテーションマスクに変換することが可能となります。さらに、細かいマスクを取得するためのアテンションガイド付きSAM2プロンプティングを導入します。既存の方法がMLLMsとSAMを共同でトレーニングするのに対し、我々の方法は完全に再トレーニングなしで動作します。DecAFは、トレーニング不要な方法を上回り、参照および推論VOSベンチマークにおいてトレーニングベースの方法と同等の性能を達成します。コードはhttps://github.com/HYUNJS/DecAFで公開予定です。

English

Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.

訓練不要のビデオ推論セグメンテーションのためのMLLMにおける分解的注意融合

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

要旨

Support