LoftUp: 비전 파운데이션 모델을 위한 좌표 기반 특징 업샘플러 학습

초록

DINOv2와 CLIP과 같은 비전 기반 모델(VFMs)은 다양한 다운스트림 작업에서 인상적인 성과를 거두었지만, 제한된 특징 해상도로 인해 픽셀 수준의 이해가 필요한 애플리케이션에서의 성능이 저하됩니다. 특징 업샘플링은 이러한 문제를 해결하기 위한 유망한 방향을 제시합니다. 본 연구에서는 특징 업샘플링을 향상시키기 위한 두 가지 중요한 요소를 식별했습니다: 업샘플러 아키텍처와 훈련 목표입니다. 업샘플러 아키텍처의 경우, 우리는 고해상도 이미지와 좌표, 저해상도 VFM 특징을 통합하여 선명하고 고품질의 특징을 생성하는 좌표 기반 교차 주의 트랜스포머를 도입했습니다. 훈련 목표의 경우, 클래스 불가지론적 마스크와 자기 증류를 활용하여 고해상도 의사 실측 특징을 구성하는 방법을 제안합니다. 우리의 접근 방식은 미세한 세부 사항을 효과적으로 포착하고 다양한 입력 및 특징 해상도에 유연하게 적응합니다. 실험을 통해 우리의 접근 방식이 다양한 다운스트림 작업에서 기존의 특징 업샘플링 기술을 크게 능가함을 입증했습니다. 우리의 코드는 https://github.com/andrehuang/loftup에서 공개되었습니다.

English

Vision foundation models (VFMs) such as DINOv2 and CLIP have achieved impressive results on various downstream tasks, but their limited feature resolution hampers performance in applications requiring pixel-level understanding. Feature upsampling offers a promising direction to address this challenge. In this work, we identify two critical factors for enhancing feature upsampling: the upsampler architecture and the training objective. For the upsampler architecture, we introduce a coordinate-based cross-attention transformer that integrates the high-resolution images with coordinates and low-resolution VFM features to generate sharp, high-quality features. For the training objective, we propose constructing high-resolution pseudo-groundtruth features by leveraging class-agnostic masks and self-distillation. Our approach effectively captures fine-grained details and adapts flexibly to various input and feature resolutions. Through experiments, we demonstrate that our approach significantly outperforms existing feature upsampling techniques across various downstream tasks. Our code is released at https://github.com/andrehuang/loftup.

LoftUp: 비전 파운데이션 모델을 위한 좌표 기반 특징 업샘플러 학습

LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models

초록

Support