LoftUp: ビジョンファウンデーションモデルのための座標ベース特徴アップサンプラーの学習

要旨

DINOv2やCLIPなどのビジョンファウンデーションモデル（VFMs）は、様々な下流タスクで印象的な結果を達成していますが、特徴解像度の制限により、ピクセルレベルの理解を必要とするアプリケーションでの性能が制約されています。特徴アップサンプリングは、この課題に対処するための有望な方向性を提供します。本研究では、特徴アップサンプリングを強化するための2つの重要な要因を特定しました。それは、アップサンプラーのアーキテクチャと訓練目的です。アップサンプラーアーキテクチャに関しては、高解像度画像と座標、低解像度のVFM特徴を統合して鮮明で高品質な特徴を生成する、座標ベースのクロスアテンショントランスフォーマーを導入しました。訓練目的に関しては、クラス非依存のマスクと自己蒸留を活用して高解像度の疑似グラウンドトゥルース特徴を構築することを提案します。私たちのアプローチは、細かいディテールを効果的に捉え、様々な入力および特徴解像度に柔軟に適応します。実験を通じて、私たちのアプローチが様々な下流タスクにおいて既存の特徴アップサンプリング技術を大幅に上回ることを実証しました。コードはhttps://github.com/andrehuang/loftupで公開されています。

English

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

LoftUp: ビジョンファウンデーションモデルのための座標ベース特徴アップサンプラーの学習

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

要旨

Support