Vision Transformersにはレジスタが必要である

要旨

Transformerは最近、視覚表現を学習するための強力なツールとして登場しました。本論文では、教師ありおよび自己教師ありViTネットワークの特徴マップに存在するアーティファクトを特定し、その特性を明らかにします。これらのアーティファクトは、推論時に主に画像の低情報量な背景領域に現れる高ノルムのトークンに対応しており、内部計算のために再利用されています。この問題を解決するために、Vision Transformerの入力シーケンスに追加のトークンを提供するというシンプルでありながら効果的なソリューションを提案します。このソリューションは、教師ありおよび自己教師ありモデルの両方において問題を完全に解決し、密な視覚予測タスクにおける自己教師あり視覚モデルの新たな最先端を確立し、より大規模なモデルを用いた物体発見手法を可能にし、最も重要なこととして、下流の視覚処理のためのより滑らかな特徴マップとアテンションマップを実現します。

English

Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads to smoother feature maps and attention maps for downstream visual processing.

Vision Transformersにはレジスタが必要である

Vision Transformers Need Registers

要旨

Support