視覺Transformer中本質忠實的注意力圖譜

摘要

我們提出了一種基於注意力機制的方法，該方法利用學習到的二值注意力掩碼來確保只有被關注的圖像區域會影響預測結果。上下文信息往往會強烈影響物體感知，有時會導致偏差的表徵，尤其是在物體出現在分佈外背景時。與此同時，許多圖像層面的以物體為中心的任務需要識別相關區域，這通常需要上下文信息。為了解決這一難題，我們提出了一個兩階段框架：第一階段處理完整圖像以發現物體部件並識別任務相關區域，而第二階段則利用輸入注意力掩碼將其感受野限制在這些區域內，從而實現聚焦分析，同時過濾掉潛在的虛假信息。兩個階段聯合訓練，使得第二階段能夠精煉第一階段的結果。在多樣化的基準測試中進行的廣泛實驗表明，我們的方法顯著提高了對抗虛假相關性和分佈外背景的魯棒性。

English

We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

視覺Transformer中本質忠實的注意力圖譜

Inherently Faithful Attention Maps for Vision Transformers

摘要

Support