디노이징 비전 트랜스포머

초록

우리는 Vision Transformers(ViTs)에 내재된 미묘하지만 중요한 문제를 깊이 있게 탐구합니다: 이러한 모델들의 특징 맵은 격자 형태의 아티팩트를 보이며, 이는 ViTs의 다운스트림 작업 성능에 해로운 영향을 미칩니다. 우리의 연구는 이 근본적인 문제를 입력 단계의 위치 임베딩으로 추적합니다. 이를 해결하기 위해, 우리는 모든 ViTs에 보편적으로 적용 가능한 새로운 노이즈 모델을 제안합니다. 구체적으로, 이 노이즈 모델은 ViT 출력을 세 가지 구성 요소로 분해합니다: 노이즈 아티팩트가 없는 의미론적 항과 픽셀 위치에 따라 조건화된 두 개의 아티팩트 관련 항입니다. 이러한 분해는 이미지 단위로 신경 필드를 사용한 교차 뷰 특징 일관성을 강제함으로써 달성됩니다. 이 이미지 단위 최적화 과정은 원시 ViT 출력에서 아티팩트가 없는 특징을 추출하여 오프라인 애플리케이션을 위한 깨끗한 특징을 제공합니다. 우리의 솔루션 범위를 온라인 기능을 지원하도록 확장하기 위해, 우리는 처리되지 않은 ViT 출력에서 직접 아티팩트가 없는 특징을 예측하는 학습 가능한 디노이저를 도입했습니다. 이 디노이저는 이미지 단위 최적화 없이도 새로운 데이터에 대해 뛰어난 일반화 능력을 보여줍니다. 우리의 두 단계 접근법, Denoising Vision Transformers(DVT)는 기존에 사전 학습된 ViTs를 재학습할 필요가 없으며, 어떤 Transformer 기반 아키텍처에도 즉시 적용 가능합니다. 우리는 다양한 대표적인 ViTs(DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg)에 대해 우리의 방법을 평가했습니다. 광범위한 평가 결과, 우리의 DVT는 여러 데이터셋에서 의미론적 및 기하학적 작업에서 기존의 최첨단 일반 목적 모델을 일관적이고 상당히 개선함을 보여줍니다(예: +3.84 mIoU). 우리의 연구가 ViT 설계, 특히 위치 임베딩의 단순한 사용에 대한 재평가를 촉진하기를 바랍니다.

English

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.