去噪视觉Transformer

摘要

我们深入探讨了视觉Transformer（ViTs）固有的一个微妙但重要的挑战：这些模型的特征图呈现出类似网格的伪影，严重影响了ViTs在下游任务中的性能。我们的研究将这一根本问题追溯到输入阶段的位置嵌入。为了解决这个问题，我们提出了一种新颖的噪声模型，适用于所有ViTs。具体而言，该噪声模型将ViT的输出分解为三个部分：一个不受噪声伪影影响的语义项，以及两个与伪影相关的项，这些项取决于像素位置。通过在每个图像基础上利用神经场强制实现跨视图特征一致性，实现了这种分解。这种每个图像的优化过程从原始ViT输出中提取出无伪影的特征，为离线应用提供清洁的特征。为了扩展我们的解决方案以支持在线功能，我们引入了一个可学习的去噪器，直接从未经处理的ViT输出中预测无伪影的特征，展现了对新数据的显著泛化能力，无需每个图像的优化。我们的两阶段方法，称为去噪视觉Transformer（DVT），无需重新训练现有的预训练ViTs，可立即应用于任何基于Transformer的架构。我们在各种代表性ViTs（DINO、MAE、DeiT-III、EVA02、CLIP、DINOv2、DINOv2-reg）上评估了我们的方法。广泛的评估表明，我们的DVT在多个数据集上的语义和几何任务中持续且显著地改善了现有的最先进通用模型（例如，+3.84 mIoU）。我们希望我们的研究将鼓励重新评估ViT的设计，特别是关于位置嵌入的朴素使用。

English

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.