ChatPaper.aiChatPaper

RefAM:零样本参考分割的注意力磁铁

RefAM: Attention Magnets for Zero-Shot Referral Segmentation

September 26, 2025
作者: Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele
cs.AI

摘要

现有的大多数指代分割方法仅通过微调或组合多个预训练模型来实现强劲性能,这往往以额外的训练和架构修改为代价。与此同时,大规模生成扩散模型编码了丰富的语义信息,使其作为通用特征提取器颇具吸引力。在本研究中,我们提出了一种新方法,直接利用扩散变换器中的特征——注意力分数——进行下游任务,既无需架构修改,也无需额外训练。为了系统评估这些特征,我们扩展了基准测试,涵盖了图像与视频的视觉-语言基础任务。我们的核心洞见是,停用词充当了注意力磁铁:它们积累过剩的注意力,可通过过滤来减少噪声。此外,我们识别出在深层出现的全局注意力汇聚点(GAS),并证明它们可以被安全地抑制或重定向到辅助标记上,从而获得更清晰、更准确的基础映射。我们进一步提出了一种注意力再分配策略,其中附加的停用词将背景激活分割成更小的簇,产生更清晰、更局部化的热图。基于这些发现,我们开发了RefAM,一个无需训练的简单基础框架,它结合了交叉注意力图、GAS处理及再分配。在零样本指代图像与视频分割基准测试中,我们的方法持续超越先前的方法,无需微调或额外组件,便确立了新的技术标杆。
English
Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.
PDF32September 29, 2025