视觉Transformer中本质可信的注意力图

摘要

我们提出了一种基于注意力机制的方法，该方法利用学习到的二值注意力掩码来确保只有被关注的图像区域影响预测结果。上下文信息会显著影响物体感知，有时会导致偏差表征，尤其是在物体出现在分布外背景中时。与此同时，许多图像级以物体为中心的任务需要识别相关区域，这往往需要上下文信息。为了解决这一难题，我们提出了一个两阶段框架：第一阶段处理完整图像以发现物体部分并识别任务相关区域，而第二阶段则利用输入注意力掩码将其感受野限制在这些区域内，从而在过滤掉潜在虚假信息的同时进行聚焦分析。两个阶段联合训练，使得第二阶段能够优化第一阶段的结果。在多个基准测试上的广泛实验表明，我们的方法显著提升了模型对虚假关联和分布外背景的鲁棒性。

English

We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

视觉Transformer中本质可信的注意力图

Inherently Faithful Attention Maps for Vision Transformers

摘要

Support