自己蒸留レジスタを備えたVision Transformers

要旨

Vision Transformers (ViTs)は、視覚処理タスクにおける主要なアーキテクチャとして台頭し、トレーニングデータとモデルサイズの増加に伴う優れたスケーラビリティを実証しています。しかし、最近の研究では、ViTsにおいて局所的なセマンティクスと整合しないアーティファクトトークンが発生することが指摘されています。これらの異常なトークンは、細粒度のローカライゼーションや構造的一貫性を必要とするタスクにおいてViTの性能を低下させます。この問題を効果的に緩和するために、ViTsにレジスタートークンを追加し、トレーニング中に暗黙的にアーティファクトを「吸収」させる方法が提案されています。大規模な事前学習済みViTsが利用可能であることを踏まえ、本論文では、それらを再トレーニングすることなくレジスタートークンを装備することを目指します。特に、Post Hoc Registers (PH-Reg)を提案します。これは、追加のラベルデータや完全な再トレーニングを必要とせずに、既存のViTにレジスタートークンを統合する効率的な自己蒸留法です。PH-Regは、教師ネットワークと生徒ネットワークの両方を同じ事前学習済みViTから初期化します。教師ネットワークは凍結され変更されませんが、生徒ネットワークにはランダムに初期化されたレジスタートークンが追加されます。教師ネットワークの入力にテストタイムアグメンテーションを適用することで、アーティファクトのないノイズ除去された密な埋め込みを生成し、それを用いて生徒ネットワークの一部の重みのみを最適化します。本手法がアーティファクトトークンの数を効果的に削減し、ゼロショットおよび線形プローブ条件下での生徒ViTのセグメンテーションと深度予測を改善できることを示します。

English

Vision Transformers (ViTs) have emerged as the dominant architecture for visual processing tasks, demonstrating excellent scalability with increased training data and model size. However, recent work has identified the emergence of artifact tokens in ViTs that are incongruous with the local semantics. These anomalous tokens degrade ViT performance in tasks that require fine-grained localization or structural coherence. An effective mitigation of this issue is to the addition of register tokens to ViTs, which implicitly "absorb" the artifact term during training. Given the availability of various large-scale pre-trained ViTs, in this paper we aim at equipping them with such register tokens without the need of re-training them from scratch, which is infeasible considering their size. Specifically, we propose Post Hoc Registers (PH-Reg), an efficient self-distillation method that integrates registers into an existing ViT without requiring additional labeled data and full retraining. PH-Reg initializes both teacher and student networks from the same pre-trained ViT. The teacher remains frozen and unmodified, while the student is augmented with randomly initialized register tokens. By applying test-time augmentation to the teacher's inputs, we generate denoised dense embeddings free of artifacts, which are then used to optimize only a small subset of unlocked student weights. We show that our approach can effectively reduce the number of artifact tokens, improving the segmentation and depth prediction of the student ViT under zero-shot and linear probing.

自己蒸留レジスタを備えたVision Transformers

Vision Transformers with Self-Distilled Registers

要旨

Summary

Support

Support