ビジョントランスフォーマーのための本質的に忠実なアテンションマップ

要旨

我々は、学習されたバイナリ注意マスクを使用して、予測に影響を与えるのが注意を向けた画像領域のみであることを保証する、注意ベースの手法を提案します。文脈は物体認識に強い影響を与えることがあり、特に物体が分布外の背景に現れる場合、偏った表現を引き起こすことがあります。一方で、多くの画像レベルの物体中心タスクでは、関連する領域を特定する必要があり、しばしば文脈を必要とします。このジレンマに対処するため、我々は2段階のフレームワークを提案します。第1段階では、物体の部分を発見し、タスクに関連する領域を特定するために画像全体を処理します。第2段階では、入力注意マスキングを活用して受容野をこれらの領域に制限し、潜在的に誤った情報をフィルタリングしながら、焦点を絞った分析を可能にします。両段階は共同で訓練され、第2段階が第1段階を洗練できるようになります。多様なベンチマークでの広範な実験により、我々のアプローチが、誤った相関や分布外の背景に対するロバスト性を大幅に向上させることが実証されました。

English

We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

ビジョントランスフォーマーのための本質的に忠実なアテンションマップ

Inherently Faithful Attention Maps for Vision Transformers

要旨

Support