LoftUp：面向视觉基础模型的坐标特征上采样器学习

摘要

诸如DINOv2和CLIP等视觉基础模型（VFMs）在多种下游任务中取得了显著成果，但其有限的特征分辨率限制了在需要像素级理解的应用中的表现。特征上采样为解决这一挑战提供了一个有前景的方向。在本研究中，我们识别出增强特征上采样的两个关键因素：上采样器架构与训练目标。针对上采样器架构，我们引入了一种基于坐标的交叉注意力Transformer，它将高分辨率图像与坐标及低分辨率VFM特征相结合，以生成清晰、高质量的特征。在训练目标方面，我们提出通过利用类别无关掩码和自蒸馏技术构建高分辨率伪真值特征。我们的方法有效捕捉了细粒度细节，并能灵活适应多种输入和特征分辨率。通过实验，我们证明了该方法在各类下游任务中显著优于现有的特征上采样技术。我们的代码已发布于https://github.com/andrehuang/loftup。

English

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

LoftUp：面向视觉基础模型的坐标特征上采样器学习

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

摘要

Support