自蒸馏寄存器视觉Transformer

摘要

视觉Transformer（ViTs）已成为视觉处理任务的主导架构，展现出随着训练数据和模型规模增加而优异的扩展性。然而，近期研究揭示了ViTs中出现的与局部语义不符的伪影标记，这些异常标记在需要精细定位或结构一致性的任务中降低了ViT的性能。一种有效的缓解方法是在ViTs中加入寄存器标记，这些标记在训练过程中隐式地“吸收”伪影。鉴于多种大规模预训练ViTs的可用性，本文旨在无需从头重新训练（考虑到其规模，这并不可行）的情况下，为它们配备此类寄存器标记。具体而言，我们提出了事后寄存器（PH-Reg），一种高效的自蒸馏方法，它能在无需额外标注数据和完全重新训练的情况下，将寄存器整合到现有ViT中。PH-Reg从同一预训练ViT初始化教师和学生网络，教师网络保持冻结且未修改，而学生网络则通过随机初始化的寄存器标记进行增强。通过对教师网络的输入应用测试时增强，我们生成无伪影的降噪密集嵌入，随后仅用于优化学生网络中一小部分解锁的权重。我们证明，该方法能有效减少伪影标记的数量，在零样本和线性探测下提升学生ViT的分割和深度预测性能。

English

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

自蒸馏寄存器视觉Transformer

Vision Transformers with Self-Distilled Registers

摘要

Summary

Support

Support