자기-증류 레지스터를 활용한 비전 트랜스포머

초록

비전 트랜스포머(ViTs)는 시각 처리 작업을 위한 주요 아키텍처로 부상하며, 증가하는 학습 데이터와 모델 크기에 탁월한 확장성을 입증했습니다. 그러나 최근 연구에서는 ViT에서 지역적 의미와 일치하지 않는 아티팩트 토큰이 발생하는 현상이 확인되었습니다. 이러한 비정상적인 토큰은 세밀한 지역화나 구조적 일관성이 필요한 작업에서 ViT의 성능을 저하시킵니다. 이 문제를 효과적으로 완화하기 위해 ViT에 레지스터 토큰을 추가하는 방법이 제안되었는데, 이는 학습 과정에서 아티팩트 항목을 암묵적으로 "흡수"합니다. 다양한 대규모 사전 학습된 ViT가 존재함을 고려할 때, 본 논문에서는 이러한 모델들을 처음부터 재학습할 필요 없이 레지스터 토큰을 추가하는 방법을 목표로 합니다. 특히, 우리는 추가 레이블 데이터와 전체 재학습 없이 기존 ViT에 레지스터를 통합하는 효율적인 자기 지식 증류 방법인 Post Hoc Registers(PH-Reg)를 제안합니다. PH-Reg는 교사 네트워크와 학생 네트워크를 동일한 사전 학습된 ViT로 초기화합니다. 교사 네트워크는 고정되고 수정되지 않은 상태로 유지되며, 학생 네트워크는 무작위로 초기화된 레지스터 토큰으로 보강됩니다. 교사 네트워크의 입력에 테스트 시간 증강을 적용함으로써 아티팩트가 없는 노이즈 제거된 밀집 임베딩을 생성하고, 이를 통해 학생 네트워크의 잠금 해제된 소수의 가중치만을 최적화합니다. 우리의 접근 방식이 아티팩트 토큰의 수를 효과적으로 줄이고, 제로샷 및 선형 탐색 조건에서 학생 ViT의 세분화 및 깊이 예측 성능을 개선할 수 있음을 보여줍니다.

English

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

자기-증류 레지스터를 활용한 비전 트랜스포머

Vision Transformers with Self-Distilled Registers

초록

Summary

Support

Support