ViT-Up: ビジョントランスフォーマーのための忠実な特徴アップサンプリング

要旨

ビジョントランスフォーマー（ViT）は視覚表現学習において支配的なアーキテクチャとなり、非常に強力で広く再利用可能なバックボーンフィーチャーを提供しています。しかしながら、ViTはグローバル自己注意の二次コストのため、比較的小さなパッチトークングリッドで動作することが一般的であり、セマンティックセグメンテーションや深度推定などの高密度予測タスクにおいて持続的なボトルネックとなっています。このことが、タスク非依存の特徴アップサンプラの開発を動機づけています。最近の最先端手法は視覚的にシャープな高密度表現を生成しますが、ガイド付きアップサンプリングのための浅い画像エンコーダへの依存が、特徴漏洩、断片化、ぼやけを引き起こす可能性があります。我々は、外部画像ガイダンスを中間ViT隠れ状態からの層ごとのクエリ構築に置き換える暗黙的特徴アップサンプリングフレームワーク、ViT-Upを導入します。これにより、バックボーンフィーチャー空間との整合性を維持しながら、任意の連続画像座標での特徴予測が可能になります。実験により、ViT-Upが高密度予測およびセマンティック対応において、最先端の画像ガイド付きアップサンプラを一貫して上回ることが示されています。DINOv3-S+において、ViT-UpはCityscapesで+2.07 mIoU、SPair-71kで+4.17 PCK@0.10の改善を達成しています。より大きなDINOv3-Bバックボーンでは、これらの改善は+3.36 mIoUおよび+8.09 PCK@0.10に増加し、ViT-Upがバックボーン容量に応じて好ましいスケーリング特性を示すことが実証されています。

English

Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.