ChatPaper.aiChatPaper

多模態大語言模型中的分解注意力融合用於無訓練視頻推理分割

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

October 22, 2025
作者: Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim
cs.AI

摘要

多模態大型語言模型(MLLMs)通過關注與文本查詢相關的視覺標記,展現出強大的視頻理解能力。為了以無需訓練的方式直接適應定位任務,我們將視頻推理分割視為視頻問答任務,並通過滾動機制提取注意力圖。然而,原始注意力圖噪聲較大,且與物體區域對齊不佳。我們提出了分解注意力融合(DecAF),該方法通過兩種機制精煉這些圖:(1) 對比物體-背景融合和 (2) 互補視頻幀融合。此方法抑制了不相關的激活並增強了物體聚焦的線索,使得注意力圖能夠直接轉換為粗分割掩碼。此外,我們引入了注意力引導的SAM2提示,以獲取細粒度掩碼。與現有方法不同,這些方法需要將MLLMs與SAM聯合訓練,而我們的方法完全無需重新訓練。DecAF在無訓練方法中表現優異,並在參考和推理視頻對象分割基準上達到了與基於訓練方法相當的性能。代碼將在https://github.com/HYUNJS/DecAF上提供。
English
Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.
PDF41October 23, 2025