デノイジング・ビジョン・トランスフォーマー

要旨

我々は、Vision Transformers（ViTs）に内在する微妙だが重要な課題に深く踏み込む：これらのモデルの特徴マップはグリッド状のアーティファクトを示し、これがViTsの下流タスクにおける性能を損なっている。我々の調査により、この根本的な問題は入力段階の位置埋め込みに起因することが明らかとなった。この問題に対処するため、我々は全てのViTsに普遍的に適用可能な新しいノイズモデルを提案する。具体的には、このノイズモデルはViTの出力を、ノイズアーティファクトのない意味論的項と、ピクセル位置に依存する二つのアーティファクト関連項に分解する。この分解は、ニューラルフィールドを用いたクロスビュー特徴の一貫性を画像ごとに強制することで達成される。この画像ごとの最適化プロセスにより、生のViT出力からアーティファクトのない特徴を抽出し、オフラインアプリケーションのためのクリーンな特徴を提供する。さらに、オンライン機能をサポートするために、未処理のViT出力から直接アーティファクトのない特徴を予測する学習可能なデノイザーを導入し、これが画像ごとの最適化を必要とせずに新規データに対して顕著な汎化能力を示す。我々の二段階アプローチ、Denoising Vision Transformers（DVT）は、既存の事前学習済みViTsの再学習を必要とせず、任意のTransformerベースのアーキテクチャに即座に適用可能である。我々は、代表的なViTs（DINO、MAE、DeiT-III、EVA02、CLIP、DINOv2、DINOv2-reg）に対して本手法を評価した。広範な評価により、我々のDVTが、複数のデータセットにわたる意味論的および幾何学的タスクにおいて、既存の最先汎用モデルを一貫して大幅に改善することが示された（例：+3.84 mIoU）。我々の研究が、特に位置埋め込みの単純な使用に関して、ViT設計の再評価を促すことを期待する。

English

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.