視覺Transformer需要暫存器

摘要

最近，Transformer已經成為學習視覺表示的強大工具。在本文中，我們識別並表徵了監督式和自監督式ViT網絡的特徵圖中的人工製品。這些人工製品對應於在推論過程中主要出現在圖像的低信息背景區域中的高範數標記，這些標記被重新用於內部計算。我們提出了一個基於向Vision Transformer的輸入序列提供額外標記的簡單而有效的解決方案，以填補該角色。我們展示了這個解決方案完全解決了監督式和自監督式模型的問題，為密集視覺預測任務上的自監督式視覺模型設立了新的技術水準，使得使用更大模型的對象發現方法成為可能，最重要的是，導致了更平滑的特徵圖和關注圖，以進行下游視覺處理。

English

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

視覺Transformer需要暫存器

Vision Transformers Need Registers

摘要

Support