Vision Transformer con Registri Auto-Distillati

Abstract

I Vision Transformer (ViT) si sono affermati come l'architettura dominante per le attività di elaborazione visiva, dimostrando un'eccellente scalabilità con l'aumento dei dati di addestramento e delle dimensioni del modello. Tuttavia, recenti studi hanno identificato l'emergere di token artefatti nei ViT che sono incongruenti con la semantica locale. Questi token anomali compromettono le prestazioni dei ViT in compiti che richiedono una localizzazione fine o una coerenza strutturale. Una mitigazione efficace di questo problema è l'aggiunta di token di registro ai ViT, che implicitamente "assorbono" il termine artefatto durante l'addestramento. Data la disponibilità di vari ViT pre-addestrati su larga scala, in questo articolo ci proponiamo di dotarli di tali token di registro senza la necessità di riaddestrarli da zero, cosa impraticabile considerando le loro dimensioni. Nello specifico, proponiamo Post Hoc Registers (PH-Reg), un metodo efficiente di auto-distillazione che integra i registri in un ViT esistente senza richiedere dati etichettati aggiuntivi e un riaddestramento completo. PH-Reg inizializza sia la rete insegnante che quella studente dallo stesso ViT pre-addestrato. L'insegnante rimane congelato e non modificato, mentre lo studente viene potenziato con token di registro inizializzati casualmente. Applicando l'aumentazione dei dati al momento del test agli input dell'insegnante, generiamo embedding densi denoizzati privi di artefatti, che vengono poi utilizzati per ottimizzare solo un piccolo sottoinsieme di pesi sbloccati dello studente. Dimostriamo che il nostro approccio può ridurre efficacemente il numero di token artefatti, migliorando la segmentazione e la previsione della profondità del ViT studente in condizioni di zero-shot e linear probing.

English

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

Vision Transformer con Registri Auto-Distillati

Vision Transformers with Self-Distilled Registers

Abstract

Support