ViT-Up: 비전 트랜스포머를 위한 충실한 특징 업샘플링

초록

비전 트랜스포머(ViT)는 시각적 표현 학습의 지배적인 아키텍처로 자리잡아, 매우 강력하고 폭넓게 재사용 가능한 백본 특징을 제공한다. 그러나 ViT는 전역 자기 주의(self-attention)의 제곱 비용으로 인해 일반적으로 상대적으로 작은 패치 토큰 그리드에서 작동하며, 이는 의미론적 분할 및 깊이 추정과 같은 조밀한 예측 작업에 지속적인 병목 현상을 초래한다. 이러한 문제는 작업에 구애받지 않는 특징 업샘플러의 개발을 촉진했다. 최신 최고 성능 방법들은 시각적으로 선명한 조밀한 표현을 생성하지만, 유도된 업샘플링을 위해 얕은 이미지 인코더에 의존함으로써 특징 누출, 단편화 및 흐림을 유발할 수 있다. 본 논문에서는 외부 이미지 유도 없이 중간 ViT 은닉 상태로부터 계층별 쿼리 구성을 통해 특징 예측을 가능하게 하는 암시적 특징 업샘플링 프레임워크인 ViT-Up을 소개한다. 이는 백본 특징 공간과의 정렬을 유지하면서 임의의 연속 이미지 좌표에서 특징을 예측할 수 있게 한다. 실험 결과, ViT-Up이 조밀한 예측 및 의미론적 대응 작업에서 최신 이미지 유도 업샘플러를 일관되게 능가함을 보여준다. DINOv3-S+에서 ViT-Up은 Cityscapes 데이터셋에서 이전 방법보다 최대 +2.07 mIoU, SPair-71k에서 +4.17 PCK@0.10 향상되었다. 더 큰 DINOv3-B 백본에서는 이러한 성능 향상이 +3.36 mIoU 및 +8.09 PCK@0.10으로 증가하여, ViT-Up이 백본 용량에 따라 유리하게 확장됨을 입증한다.

English

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.