LoftUp：學習基於座標的特徵上採樣器用於視覺基礎模型

摘要

如DINOv2和CLIP等視覺基礎模型（VFMs）在多種下游任務中取得了令人矚目的成果，但其有限的特徵解析度在需要像素級理解的應用中表現受限。特徵上採樣為解決這一挑戰提供了一個有前景的方向。在本研究中，我們確定了增強特徵上採樣的兩個關鍵因素：上採樣器架構和訓練目標。針對上採樣器架構，我們引入了一種基於座標的交叉注意力Transformer，它將高解析度圖像與座標及低解析度VFM特徵相結合，以生成清晰、高品質的特徵。對於訓練目標，我們提出利用類別無關掩碼和自蒸餾來構建高解析度偽地面真值特徵。我們的方法有效地捕捉了細粒度細節，並能靈活適應各種輸入和特徵解析度。通過實驗，我們證明我們的方法在多種下游任務中顯著優於現有的特徵上採樣技術。我們的代碼已發佈於https://github.com/andrehuang/loftup。

English

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

LoftUp：學習基於座標的特徵上採樣器用於視覺基礎模型

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

摘要

Support