RefAM:零样本参考分割的注意力磁铁
RefAM: Attention Magnets for Zero-Shot Referral Segmentation
September 26, 2025
作者: Anna Kukleva, Enis Simsar, Alessio Tonioni, Muhammad Ferjad Naeem, Federico Tombari, Jan Eric Lenssen, Bernt Schiele
cs.AI
摘要
现有的大多数指代分割方法通常仅通过微调或组合多个预训练模型来实现强劲性能,这往往以额外的训练和架构修改为代价。与此同时,大规模生成式扩散模型编码了丰富的语义信息,使其作为通用特征提取器颇具吸引力。在本研究中,我们提出了一种新方法,直接利用扩散变换器中的特征——注意力分数,用于下游任务,既无需架构改动,也无需额外训练。为了系统评估这些特征,我们扩展了基准测试,涵盖了图像和视频的视觉-语言定位任务。我们的核心发现是,停用词充当了注意力磁铁:它们积累过剩的注意力,可以通过过滤来减少噪声。此外,我们识别出在深层出现的全局注意力汇聚点(GAS),并证明它们可以被安全地抑制或重定向到辅助标记上,从而生成更清晰、更准确的定位图。我们进一步提出了一种注意力再分配策略,其中附加的停用词将背景激活分割成更小的簇,产生更锐利、更局部化的热图。基于这些发现,我们开发了RefAM,一个无需训练的简单定位框架,它结合了交叉注意力图、GAS处理及再分配技术。在零样本指代图像和视频分割基准测试中,我们的方法持续超越先前的方法,无需微调或额外组件,便确立了新的技术标杆。
English
Most existing approaches to referring segmentation achieve strong performance
only through fine-tuning or by composing multiple pre-trained models, often at
the cost of additional training and architectural modifications. Meanwhile,
large-scale generative diffusion models encode rich semantic information,
making them attractive as general-purpose feature extractors. In this work, we
introduce a new method that directly exploits features, attention scores, from
diffusion transformers for downstream tasks, requiring neither architectural
modifications nor additional training. To systematically evaluate these
features, we extend benchmarks with vision-language grounding tasks spanning
both images and videos. Our key insight is that stop words act as attention
magnets: they accumulate surplus attention and can be filtered to reduce noise.
Moreover, we identify global attention sinks (GAS) emerging in deeper layers
and show that they can be safely suppressed or redirected onto auxiliary
tokens, leading to sharper and more accurate grounding maps. We further propose
an attention redistribution strategy, where appended stop words partition
background activations into smaller clusters, yielding sharper and more
localized heatmaps. Building on these findings, we develop RefAM, a simple
training-free grounding framework that combines cross-attention maps, GAS
handling, and redistribution. Across zero-shot referring image and video
segmentation benchmarks, our approach consistently outperforms prior methods,
establishing a new state of the art without fine-tuning or additional
components.