視覺Transformer無需訓練的暫存器
Vision Transformers Don't Need Trained Registers
June 9, 2025
作者: Nick Jiang, Amil Dravid, Alexei Efros, Yossi Gandelsman
cs.AI
摘要
我們深入研究了視覺變換器(Vision Transformers)中先前發現的一種現象的機制——高範數(high-norm)令牌的出現導致注意力圖譜出現噪聲。我們觀察到,在多個模型(如CLIP、DINOv2)中,一組稀疏的神經元負責將高範數激活集中於異常令牌上,從而引發不規則的注意力模式,並降低下游視覺處理的效能。雖然現有的解決方案是通過重新訓練模型並加入額外的學習寄存器令牌來消除這些異常值,但我們利用研究發現,提出了一種無需重新訓練的方法來減輕這些人工效應。通過將我們發現的寄存器神經元中的高範數激活轉移到一個未經訓練的額外令牌上,我們能夠模擬寄存器令牌在未預先配置寄存器的已訓練模型上的效果。我們證明,該方法能生成更清晰的注意力和特徵圖譜,在多個下游視覺任務中提升基礎模型的性能,並達到與顯式訓練寄存器令牌模型相當的結果。隨後,我們將測試時寄存器擴展至現成的視覺-語言模型,以提升其可解釋性。我們的結果表明,測試時寄存器有效地承擔了測試階段寄存器令牌的角色,為任何未配備寄存器令牌的預訓練模型提供了一種無需重新訓練的解決方案。
English
We investigate the mechanism underlying a previously identified phenomenon in
Vision Transformers -- the emergence of high-norm tokens that lead to noisy
attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a
sparse set of neurons is responsible for concentrating high-norm activations on
outlier tokens, leading to irregular attention patterns and degrading
downstream visual processing. While the existing solution for removing these
outliers involves retraining models from scratch with additional learned
register tokens, we use our findings to create a training-free approach to
mitigate these artifacts. By shifting the high-norm activations from our
discovered register neurons into an additional untrained token, we can mimic
the effect of register tokens on a model already trained without registers. We
demonstrate that our method produces cleaner attention and feature maps,
enhances performance over base models across multiple downstream visual tasks,
and achieves results comparable to models explicitly trained with register
tokens. We then extend test-time registers to off-the-shelf vision-language
models to improve their interpretability. Our results suggest that test-time
registers effectively take on the role of register tokens at test-time,
offering a training-free solution for any pre-trained model released without
them.