視覺Transformer需要暫存器
Vision Transformers Need Registers
September 28, 2023
作者: Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski
cs.AI
摘要
最近,Transformer已經成為學習視覺表示的強大工具。在本文中,我們識別並表徵了監督式和自監督式ViT網絡的特徵圖中的人工製品。這些人工製品對應於在推論過程中主要出現在圖像的低信息背景區域中的高範數標記,這些標記被重新用於內部計算。我們提出了一個基於向Vision Transformer的輸入序列提供額外標記的簡單而有效的解決方案,以填補該角色。我們展示了這個解決方案完全解決了監督式和自監督式模型的問題,為密集視覺預測任務上的自監督式視覺模型設立了新的技術水準,使得使用更大模型的對象發現方法成為可能,最重要的是,導致了更平滑的特徵圖和關注圖,以進行下游視覺處理。
English
Transformers have recently emerged as a powerful tool for learning visual
representations. In this paper, we identify and characterize artifacts in
feature maps of both supervised and self-supervised ViT networks. The artifacts
correspond to high-norm tokens appearing during inference primarily in
low-informative background areas of images, that are repurposed for internal
computations. We propose a simple yet effective solution based on providing
additional tokens to the input sequence of the Vision Transformer to fill that
role. We show that this solution fixes that problem entirely for both
supervised and self-supervised models, sets a new state of the art for
self-supervised visual models on dense visual prediction tasks, enables object
discovery methods with larger models, and most importantly leads to smoother
feature maps and attention maps for downstream visual processing.