自蒸馏寄存器视觉变换器
Vision Transformers with Self-Distilled Registers
May 27, 2025
作者: Yinjie Chen, Zipeng Yan, Chong Zhou, Bo Dai, Andrew F. Luo
cs.AI
摘要
視覺變換器(Vision Transformers, ViTs)已成為視覺處理任務中的主流架構,展現出隨著訓練數據和模型規模增加而優異的擴展性。然而,近期研究發現ViTs中出現了與局部語義不符的偽影令牌,這些異常令牌在需要精細定位或結構一致性的任務中降低了ViT的性能。有效緩解此問題的方法是向ViTs添加寄存器令牌,這些令牌在訓練過程中隱式地“吸收”偽影項。鑒於現有各種大規模預訓練ViTs的可用性,本文旨在無需從頭訓練(考慮到其規模,這是不現實的)的情況下,為這些模型配備此類寄存器令牌。具體而言,我們提出了事後寄存器(Post Hoc Registers, PH-Reg),這是一種高效的自我蒸餾方法,能夠將寄存器整合到現有的ViT中,而無需額外的標記數據和全面重新訓練。PH-Reg從同一預訓練ViT初始化教師和學生網絡,教師網絡保持凍結且未經修改,而學生網絡則通過隨機初始化的寄存器令牌進行增強。通過對教師網絡的輸入應用測試時增強,我們生成無偽影的去噪密集嵌入,隨後僅用於優化學生網絡中一小部分解鎖的權重。我們證明,該方法能有效減少偽影令牌的數量,在零樣本和線性探測下提升學生ViT的分割和深度預測性能。
English
Vision Transformers (ViTs) have emerged as the dominant architecture for
visual processing tasks, demonstrating excellent scalability with increased
training data and model size. However, recent work has identified the emergence
of artifact tokens in ViTs that are incongruous with the local semantics. These
anomalous tokens degrade ViT performance in tasks that require fine-grained
localization or structural coherence. An effective mitigation of this issue is
to the addition of register tokens to ViTs, which implicitly "absorb" the
artifact term during training. Given the availability of various large-scale
pre-trained ViTs, in this paper we aim at equipping them with such register
tokens without the need of re-training them from scratch, which is infeasible
considering their size. Specifically, we propose Post Hoc Registers (PH-Reg),
an efficient self-distillation method that integrates registers into an
existing ViT without requiring additional labeled data and full retraining.
PH-Reg initializes both teacher and student networks from the same pre-trained
ViT. The teacher remains frozen and unmodified, while the student is augmented
with randomly initialized register tokens. By applying test-time augmentation
to the teacher's inputs, we generate denoised dense embeddings free of
artifacts, which are then used to optimize only a small subset of unlocked
student weights. We show that our approach can effectively reduce the number of
artifact tokens, improving the segmentation and depth prediction of the student
ViT under zero-shot and linear probing.Summary
AI-Generated Summary