JAFAR: 모든 해상도에서 모든 특징을 강화

초록

Foundation Vision Encoder는 다양한 고밀도 비전 작업에서 필수적인 요소가 되었습니다. 그러나 이러한 인코더의 저해상도 공간 특징 출력은 다운스트림 작업에 필요한 고해상도 모달리티를 생성하기 위해 특징 업샘플링을 필요로 합니다. 본 연구에서는 JAFAR를 소개합니다. JAFAR는 어떤 Foundation Vision Encoder의 시각적 특징이라도 임의의 목표 해상도로 향상시킬 수 있는 가볍고 유연한 특징 업샘플러입니다. JAFAR는 저수준 이미지 특징에서 파생된 고해상도 쿼리와 의미적으로 풍부한 저해상도 키 간의 의미적 정렬을 촉진하기 위해 설계된 어텐션 기반 모듈을 사용하며, Spatial Feature Transform(SFT) 변조를 활용합니다. 특히, 고해상도 감독이 없음에도 불구하고, 저업샘플링 비율과 해상도에서의 학습이 상당히 높은 출력 스케일로도 탁월하게 일반화됨을 입증합니다. 광범위한 실험을 통해 JAFAR가 미세한 공간 세부 사항을 효과적으로 복구하고 다양한 다운스트림 작업에서 기존의 특징 업샘플링 방법들을 일관되게 능가함을 보여줍니다. 프로젝트 페이지는 https://jafar-upsampler.github.io에서 확인할 수 있습니다.

English

Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io

JAFAR: 모든 해상도에서 모든 특징을 강화

JAFAR: Jack up Any Feature at Any Resolution

초록

Support