ChatPaper.aiChatPaper

ViT-Up:面向视觉Transformer的忠实特征上采样

ViT-Up: Faithful Feature Upsampling for Vision Transformers

June 12, 2026
作者: Krispin Wandel, Jingchuan Wang, Hesheng Wang
cs.AI

摘要

视觉Transformer(ViT)已成为视觉表征学习的主流架构,能够提供极其强大且广泛可复用的骨干网络特征。然而,由于全局自注意力的二次计算开销,ViT通常在相对较小的图块-标记网格上运行,这为语义分割、深度估计等密集预测任务带来了持续的性能瓶颈。这一局限催生了与任务无关的特征上采样方法。尽管近期最先进的方法能生成视觉清晰的密集表征,但其依赖浅层图像编码器进行引导上采样,可能导致特征泄露、碎片化和模糊化问题。本文提出ViT-Up——一种隐式特征上采样框架,通过从ViT中间隐藏状态逐层构建查询向量,替代外部图像引导。该方法可在任意连续图像坐标上预测特征,同时保持与骨干特征空间的对齐。实验表明,在密集预测和语义对应任务中,ViT-Up始终优于最先进的图像引导上采样器。在DINOv3-S+骨干网络上,ViT-Up在Cityscapes数据集上mIoU提升高达+2.07,在SPair-71k数据集上PCK@0.10提升达+4.17。当采用更大的DINOv3-B骨干网络时,这些增益分别扩大至+3.36 mIoU和+8.09 PCK@0.10,证明ViT-Up的性能提升与骨干网络容量呈正相关。
English
Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.