视觉Transformer无需训练寄存器

摘要

我们深入研究了视觉Transformer中先前发现的一种现象——高范数（high-norm）令牌的出现导致注意力图谱噪声化的内在机制。通过观察多个模型（如CLIP、DINOv2），我们发现，一组稀疏的神经元负责将高范数激活集中在异常令牌上，从而引发不规则的注意力模式，并削弱下游视觉处理性能。尽管现有解决方案需通过重新训练模型并引入额外的学习寄存器令牌来消除这些异常值，但基于我们的发现，我们提出了一种无需重新训练的方法来缓解这些伪影。通过将我们发现的寄存器神经元中的高范数激活转移到一个未经训练的额外令牌上，我们能够模拟寄存器令牌在未预先配置寄存器令牌的模型上的效果。实验证明，我们的方法能生成更清晰的注意力和特征图谱，在多种下游视觉任务中提升基础模型的性能，并取得与显式训练寄存器令牌模型相当的结果。进一步，我们将测试时寄存器扩展至现成的视觉-语言模型，以增强其可解释性。研究结果表明，测试时寄存器有效地承担了测试阶段寄存器令牌的角色，为任何未预先配备寄存器令牌的预训练模型提供了一种无需重新训练的解决方案。

English

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

视觉Transformer无需训练寄存器

Vision Transformers Don't Need Trained Registers

摘要

Support