ViT-Up: 面向视觉Transformer的忠实特征上采样

摘要

视觉Transformer（ViTs）已成为视觉表示学习的主导架构，能够提供极其强大且可广泛复用的主干特征。然而，由于全局自注意力的二次复杂度，ViTs通常运行在相对较小的分块令牌网格上，这给语义分割和深度估计等密集预测任务带来了持续的性能瓶颈，进而推动了任务无关的特征上采样器的发展。尽管现有最先进方法能生成视觉上锐利的密集表示，但其依赖浅层图像编码器进行引导上采样的方式，可能引发特征泄漏、碎片化与模糊问题。我们提出ViT-Up，一种隐式特征上采样框架，它通过从中间ViT隐藏状态构建逐层查询来替代外部图像引导，从而能够在任意连续的图像坐标上预测特征，同时保持与主干特征空间的对齐。实验表明，在密集预测与语义对应任务中，ViT-Up持续优于最先进的图像引导上采样器。在DINOv3-S+上，ViT-Up在Cityscapes上的mIoU提升最多达+2.07，在SPair-71k上的PCK@0.10提升最多达+4.17。结合更大的DINOv3-B主干，这些增益分别增至+3.36 mIoU和+8.09 PCK@0.10，表明ViT-Up的性能随主干容量提升而可扩展。

English

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.