Denoisende Vision Transformers

Samenvatting

We verdiepen ons in een genuanceerd maar significant probleem dat inherent is aan Vision Transformers (ViTs): de feature maps van deze modellen vertonen rasterachtige artefacten, wat de prestaties van ViTs in downstream taken nadelig beïnvloedt. Ons onderzoek leidt dit fundamentele probleem terug naar de positionele embeddings in de invoerfase. Om dit aan te pakken, stellen we een nieuw ruismodel voor, dat universeel toepasbaar is op alle ViTs. Specifiek ontleedt het ruismodel de uitvoer van ViTs in drie componenten: een semantische term die vrij is van ruisartefacten en twee artefactgerelateerde termen die afhankelijk zijn van pixelposities. Een dergelijke decompositie wordt bereikt door cross-view feature consistentie af te dwingen met neurale velden op een per-image basis. Dit per-image optimalisatieproces haalt artefactvrije features uit de ruwe ViT-uitvoer, wat schone features oplevert voor offline toepassingen. Om onze oplossing uit te breiden naar online functionaliteit, introduceren we een leerbare denoiser om artefactvrije features direct uit onbewerkte ViT-uitvoer te voorspellen, wat opmerkelijke generalisatiecapaciteiten toont naar nieuwe data zonder de noodzaak van per-image optimalisatie. Onze tweefasenbenadering, genaamd Denoising Vision Transformers (DVT), vereist niet het opnieuw trainen van bestaande vooraf getrainde ViTs en is direct toepasbaar op elke Transformer-gebaseerde architectuur. We evalueren onze methode op een verscheidenheid aan representatieve ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Uitgebreide evaluaties tonen aan dat onze DVT consistent en significant de bestaande state-of-the-art algemene modellen verbetert in semantische en geometrische taken over meerdere datasets (bijv., +3.84 mIoU). We hopen dat onze studie een herziening van het ViT-ontwerp zal aanmoedigen, met name wat betreft het naïeve gebruik van positionele embeddings.

English

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields in a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean features for offline applications. Expanding the scope of our solution to support online functionality, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, which shows remarkable generalization capabilities to novel data without the need for per-image optimization. Our two-stage approach, termed Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs and is immediately applicable to any Transformer-based architecture. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.