視覺Transformer無需訓練的暫存器

摘要

我們深入研究了視覺變換器（Vision Transformers）中先前發現的一種現象的機制——高範數（high-norm）令牌的出現導致注意力圖譜出現噪聲。我們觀察到，在多個模型（如CLIP、DINOv2）中，一組稀疏的神經元負責將高範數激活集中於異常令牌上，從而引發不規則的注意力模式，並降低下游視覺處理的效能。雖然現有的解決方案是通過重新訓練模型並加入額外的學習寄存器令牌來消除這些異常值，但我們利用研究發現，提出了一種無需重新訓練的方法來減輕這些人工效應。通過將我們發現的寄存器神經元中的高範數激活轉移到一個未經訓練的額外令牌上，我們能夠模擬寄存器令牌在未預先配置寄存器的已訓練模型上的效果。我們證明，該方法能生成更清晰的注意力和特徵圖譜，在多個下游視覺任務中提升基礎模型的性能，並達到與顯式訓練寄存器令牌模型相當的結果。隨後，我們將測試時寄存器擴展至現成的視覺-語言模型，以提升其可解釋性。我們的結果表明，測試時寄存器有效地承擔了測試階段寄存器令牌的角色，為任何未配備寄存器令牌的預訓練模型提供了一種無需重新訓練的解決方案。

English

We investigate the mechanism underlying a previously identified phenomenon in Vision Transformers -- the emergence of high-norm tokens that lead to noisy attention maps. We observe that in multiple models (e.g., CLIP, DINOv2), a sparse set of neurons is responsible for concentrating high-norm activations on outlier tokens, leading to irregular attention patterns and degrading downstream visual processing. While the existing solution for removing these outliers involves retraining models from scratch with additional learned register tokens, we use our findings to create a training-free approach to mitigate these artifacts. By shifting the high-norm activations from our discovered register neurons into an additional untrained token, we can mimic the effect of register tokens on a model already trained without registers. We demonstrate that our method produces cleaner attention and feature maps, enhances performance over base models across multiple downstream visual tasks, and achieves results comparable to models explicitly trained with register tokens. We then extend test-time registers to off-the-shelf vision-language models to improve their interpretability. Our results suggest that test-time registers effectively take on the role of register tokens at test-time, offering a training-free solution for any pre-trained model released without them.

視覺Transformer無需訓練的暫存器

Vision Transformers Don't Need Trained Registers

摘要

Support