ViT-Up: Getrouwe Kenmerken Upsampling voor Vision Transformers

Samenvatting

Vision Transformers (ViTs) zijn een dominante architectuur geworden voor visuele representatielearning, met uitzonderlijk sterke en breed herbruikbare backbone-kenmerken. ViTs worden echter vaak toegepast op relatief kleine patch-token roosters vanwege de kwadratische kosten van globale self-attention, wat een aanhoudende bottleneck vormt voor dichte voorspellingstaken zoals semantische segmentatie en diepteschatting. Dit heeft de ontwikkeling van taakonafhankelijke kenmerk-opsamplers gestimuleerd. Hoewel recente state-of-the-art methoden visueel scherpe dichte representaties opleveren, kan hun afhankelijkheid van ondiepe beeldencoders voor geleide opsampling leiden tot kenmerklekken, fragmentatie en vervaging. Wij introduceren ViT-Up, een impliciet kenmerk-opsamplingraamwerk dat externe beeldbegeleiding vervangt door laagsgewijze queryconstructie uit tussentijdse ViT-verborgen toestanden. Dit maakt kenmerkvoorspelling mogelijk op willekeurige continue beeldcoördinaten, terwijl de afstemming met de backbone-kenmerkruimte behouden blijft. Experimenten tonen aan dat ViT-Up consequent beter presteert dan state-of-the-art beeldgeleide opsamplers voor zowel dichte voorspelling als semantische correspondentie. Op DINOv3-S+ verbetert ViT-Up de prestaties ten opzichte van eerdere methoden met tot +2,07 mIoU op Cityscapes en +4,17 PCK@0,10 op SPair-71k. Met de grotere DINOv3-B backbone stijgen deze winsten tot +3,36 mIoU en +8,09 PCK@0,10, wat aantoont dat ViT-Up gunstig schaalt met de backbone-capaciteit.

English

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.