비전 트랜스포머는 레지스터가 필요하다

초록

트랜스포머(Transformers)는 최근 시각 표현 학습을 위한 강력한 도구로 부상했습니다. 본 논문에서는 지도 학습 및 자기 지도 학습 ViT 네트워크의 특징 맵에서 나타나는 아티팩트를 식별하고 그 특성을 분석합니다. 이러한 아티팩트는 주로 이미지의 정보가 적은 배경 영역에서 추론 과정 중에 나타나는 높은 노름(norm) 값을 가진 토큰에 해당하며, 이는 내부 계산을 위해 재사용됩니다. 우리는 이러한 역할을 대체하기 위해 Vision Transformer의 입력 시퀀스에 추가 토큰을 제공하는 간단하면서도 효과적인 해결책을 제안합니다. 이 해결책은 지도 학습 및 자기 지도 학습 모델 모두에서 해당 문제를 완전히 해결하며, 밀집 시각 예측 작업에서 자기 지도 학습 시각 모델의 최신 기술 수준을 달성하고, 더 큰 모델을 사용한 객체 발견 방법을 가능하게 하며, 무엇보다도 다운스트림 시각 처리를 위한 더 부드러운 특징 맵과 어텐션 맵을 제공합니다.

English

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

비전 트랜스포머는 레지스터가 필요하다

Vision Transformers Need Registers

초록

Support