視覺Transformer中本質忠實的注意力圖譜
Inherently Faithful Attention Maps for Vision Transformers
June 10, 2025
作者: Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos
cs.AI
摘要
我們提出了一種基於注意力機制的方法,該方法利用學習到的二值注意力掩碼來確保只有被關注的圖像區域會影響預測結果。上下文信息往往會強烈影響物體感知,有時會導致偏差的表徵,尤其是在物體出現在分佈外背景時。與此同時,許多圖像層面的以物體為中心的任務需要識別相關區域,這通常需要上下文信息。為了解決這一難題,我們提出了一個兩階段框架:第一階段處理完整圖像以發現物體部件並識別任務相關區域,而第二階段則利用輸入注意力掩碼將其感受野限制在這些區域內,從而實現聚焦分析,同時過濾掉潛在的虛假信息。兩個階段聯合訓練,使得第二階段能夠精煉第一階段的結果。在多樣化的基準測試中進行的廣泛實驗表明,我們的方法顯著提高了對抗虛假相關性和分佈外背景的魯棒性。
English
We introduce an attention-based method that uses learned binary attention
masks to ensure that only attended image regions influence the prediction.
Context can strongly affect object perception, sometimes leading to biased
representations, particularly when objects appear in out-of-distribution
backgrounds. At the same time, many image-level object-centric tasks require
identifying relevant regions, often requiring context. To address this
conundrum, we propose a two-stage framework: stage 1 processes the full image
to discover object parts and identify task-relevant regions, while stage 2
leverages input attention masking to restrict its receptive field to these
regions, enabling a focused analysis while filtering out potentially spurious
information. Both stages are trained jointly, allowing stage 2 to refine stage
1. Extensive experiments across diverse benchmarks demonstrate that our
approach significantly improves robustness against spurious correlations and
out-of-distribution backgrounds.