ChatPaper.aiChatPaper

多模态大语言模型中的分解注意力融合用于免训练视频推理分割

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

October 22, 2025
作者: Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim
cs.AI

摘要

多模态大语言模型(MLLMs)通过关注与文本查询相关的视觉标记,展现出强大的视频理解能力。为了在无需训练的情况下直接适应定位任务,我们将视频推理分割视为视频问答任务,并通过展开机制提取注意力图。然而,原始注意力图存在噪声且与目标区域对齐不佳。我们提出了分解注意力融合(DecAF),通过两种机制优化这些图:(1)对比性目标-背景融合和(2)互补性视频帧融合。该方法抑制了无关激活并增强了目标聚焦线索,使得注意力图能够直接转换为粗略分割掩码。此外,我们引入了注意力引导的SAM2提示机制,用于获取细粒度掩码。与现有方法联合训练MLLMs和SAM不同,我们的方法完全无需重新训练。DecAF在无需训练的方法中表现优异,并在参考和推理视频对象分割基准上达到了与基于训练方法相当的性能。代码将发布于https://github.com/HYUNJS/DecAF。
English
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.
PDF41October 23, 2025