비전 트랜스포머를 위한 본질적으로 신뢰할 수 있는 어텐션 맵

초록

우리는 학습된 이진 어텐션 마스크를 사용하여 예측에 영향을 미치는 영역을 오직 어텐션이 적용된 이미지 영역으로 제한하는 어텐션 기반 방법을 소개한다. 컨텍스트는 객체 인식에 강력한 영향을 미칠 수 있으며, 특히 객체가 분포 외 배경에 나타날 때 편향된 표현을 초래할 수 있다. 동시에, 많은 이미지 수준의 객체 중심 작업은 관련 영역을 식별해야 하며, 이는 종종 컨텍스트를 필요로 한다. 이러한 딜레마를 해결하기 위해, 우리는 두 단계 프레임워크를 제안한다: 첫 번째 단계는 전체 이미지를 처리하여 객체 부분을 발견하고 작업 관련 영역을 식별하며, 두 번째 단계는 입력 어텐션 마스킹을 활용하여 수용 영역을 이러한 영역으로 제한함으로써 잠재적으로 잘못된 정보를 걸러내고 집중적인 분석을 가능하게 한다. 두 단계는 공동으로 학습되어 두 번째 단계가 첫 번째 단계를 개선할 수 있도록 한다. 다양한 벤치마크에서의 광범위한 실험을 통해 우리의 접근법이 잘못된 상관관계와 분포 외 배경에 대한 견고성을 크게 향상시킨다는 것을 입증한다.

English

We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

비전 트랜스포머를 위한 본질적으로 신뢰할 수 있는 어텐션 맵

Inherently Faithful Attention Maps for Vision Transformers

초록

Support