ViT-Up：面向视觉Transformer的忠实特征上采样

摘要

视觉Transformer（ViT）已成为视觉表征学习的主流架构，能够提供极其强大且广泛可复用的骨干网络特征。然而，由于全局自注意力的二次计算开销，ViT通常在相对较小的图块-标记网格上运行，这为语义分割、深度估计等密集预测任务带来了持续的性能瓶颈。这一局限催生了与任务无关的特征上采样方法。尽管近期最先进的方法能生成视觉清晰的密集表征，但其依赖浅层图像编码器进行引导上采样，可能导致特征泄露、碎片化和模糊化问题。本文提出ViT-Up——一种隐式特征上采样框架，通过从ViT中间隐藏状态逐层构建查询向量，替代外部图像引导。该方法可在任意连续图像坐标上预测特征，同时保持与骨干特征空间的对齐。实验表明，在密集预测和语义对应任务中，ViT-Up始终优于最先进的图像引导上采样器。在DINOv3-S+骨干网络上，ViT-Up在Cityscapes数据集上mIoU提升高达+2.07，在SPair-71k数据集上PCK@0.10提升达+4.17。当采用更大的DINOv3-B骨干网络时，这些增益分别扩大至+3.36 mIoU和+8.09 PCK@0.10，证明ViT-Up的性能提升与骨干网络容量呈正相关。

English

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.