去噪視覺轉換器

摘要

我們深入探討了視覺Transformer（ViTs）固有的一個微妙但重要的挑戰：這些模型的特徵圖呈現網格狀的人工痕跡，這對ViTs在下游任務中的表現造成了不利影響。我們的研究將這個根本問題追溯到輸入階段的位置嵌入。為了應對這一問題，我們提出了一個新穎的噪聲模型，適用於所有ViTs。具體來說，該噪聲模型將ViT的輸出分解為三個部分：一個不受噪聲痕跡影響的語義項，以及兩個與像素位置相關的痕跡相關項。通過在每個圖像基礎上使用神經場來實現跨視圖特徵一致性，實現了這種分解。這種圖像基礎的優化過程從原始的ViT輸出中提取出無痕跡的特徵，為離線應用提供乾淨的特徵。為了擴展我們的解決方案以支持在線功能，我們引入了一個可學習的去噪器，直接從未處理的ViT輸出中預測無痕跡的特徵，並展現了對新數據的顯著泛化能力，無需進行圖像基礎的優化。我們的兩階段方法被稱為去噪視覺Transformer（DVT），不需要重新訓練現有的預訓練ViTs，並且可以立即應用於任何基於Transformer的架構。我們在各種代表性的ViTs（DINO、MAE、DeiT-III、EVA02、CLIP、DINOv2、DINOv2-reg）上評估了我們的方法。廣泛的評估表明，我們的DVT在多個數據集上的語義和幾何任務中持續且顯著地提升了現有的最先進通用模型（例如，+3.84 mIoU）。我們希望我們的研究將鼓勵重新評估ViT的設計，特別是關於位置嵌入的天真使用。

English

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.