视觉Transformer需要寄存器

摘要

最近，Transformer作为学习视觉表示的强大工具崭露头角。本文中，我们识别并表征了监督和自监督ViT网络特征图中的伪像。这些伪像对应于推断过程中主要出现在图像低信息背景区域的高范数标记，这些标记被重新用于内部计算。我们提出了一个简单而有效的解决方案，即向Vision Transformer的输入序列提供额外的标记来填补这一角色。我们展示了这一解决方案完全修复了监督和自监督模型的问题，为自监督视觉模型在密集视觉预测任务上树立了新的技术水平，实现了更大模型的对象发现方法，并且最重要的是，使得下游视觉处理的特征图和注意力图更加平滑。

English

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

视觉Transformer需要寄存器

Vision Transformers Need Registers

摘要

Support