RefAM: 제로샷 레퍼럴 세그멘테이션을 위한 주의 자석

초록

기존의 참조 분할(referring segmentation) 접근법 대부분은 미세 조정(fine-tuning)이나 여러 사전 학습된 모델을 조합하는 방식으로만 강력한 성능을 달성하며, 이는 종종 추가적인 학습과 구조 수정을 필요로 합니다. 한편, 대규모 생성적 확산 모델(generative diffusion models)은 풍부한 의미론적 정보를 인코딩하여 범용 특징 추출기로 매력적인 가능성을 보여줍니다. 본 연구에서는 확산 트랜스포머(diffusion transformers)의 특징과 어텐션 점수(attention scores)를 직접 활용하여 하위 작업에 적용하는 새로운 방법을 제안하며, 이는 구조 수정이나 추가 학습 없이도 가능합니다. 이러한 특징을 체계적으로 평가하기 위해, 이미지와 비디오를 아우르는 시각-언어 접지(vision-language grounding) 작업을 포함한 벤치마크를 확장했습니다. 우리의 핵심 통찰은 불용어(stop words)가 어텐션 자석 역할을 한다는 것입니다: 불용어는 과잉 어텐션을 축적하며, 이를 필터링하여 노이즈를 줄일 수 있습니다. 또한, 더 깊은 층에서 나타나는 전역 어텐션 싱크(global attention sinks, GAS)를 식별하고, 이를 안전하게 억제하거나 보조 토큰(auxiliary tokens)으로 재지향함으로써 더 선명하고 정확한 접지 맵(grounding maps)을 얻을 수 있음을 보여줍니다. 더 나아가, 추가된 불용어가 배경 활성화(background activations)를 더 작은 클러스터로 분할하여 더 선명하고 지역화된 히트맵(heatmaps)을 생성하는 어텐션 재분배 전략을 제안합니다. 이러한 발견을 바탕으로, 교차 어텐션 맵(cross-attention maps), GAS 처리, 재분배를 결합한 간단한 학습 없는 접지 프레임워크인 RefAM을 개발했습니다. 제로샷(zero-shot) 참조 이미지 및 비디오 분할 벤치마크에서 우리의 접근법은 기존 방법들을 일관되게 능가하며, 미세 조정이나 추가 구성 요소 없이도 새로운 최첨단 기술을 확립했습니다.

English

Most existing approaches to referring segmentation achieve strong performance only through fine-tuning or by composing multiple pre-trained models, often at the cost of additional training and architectural modifications. Meanwhile, large-scale generative diffusion models encode rich semantic information, making them attractive as general-purpose feature extractors. In this work, we introduce a new method that directly exploits features, attention scores, from diffusion transformers for downstream tasks, requiring neither architectural modifications nor additional training. To systematically evaluate these features, we extend benchmarks with vision-language grounding tasks spanning both images and videos. Our key insight is that stop words act as attention magnets: they accumulate surplus attention and can be filtered to reduce noise. Moreover, we identify global attention sinks (GAS) emerging in deeper layers and show that they can be safely suppressed or redirected onto auxiliary tokens, leading to sharper and more accurate grounding maps. We further propose an attention redistribution strategy, where appended stop words partition background activations into smaller clusters, yielding sharper and more localized heatmaps. Building on these findings, we develop RefAM, a simple training-free grounding framework that combines cross-attention maps, GAS handling, and redistribution. Across zero-shot referring image and video segmentation benchmarks, our approach consistently outperforms prior methods, establishing a new state of the art without fine-tuning or additional components.

RefAM: 제로샷 레퍼럴 세그멘테이션을 위한 주의 자석

RefAM: Attention Magnets for Zero-Shot Referral Segmentation

초록

Support