JAFAR:任意分辨率下的任意特征增强
JAFAR: Jack up Any Feature at Any Resolution
June 10, 2025
作者: Paul Couairon, Loick Chambon, Louis Serrano, Jean-Emmanuel Haugeard, Matthieu Cord, Nicolas Thome
cs.AI
摘要
基础视觉编码器已成为众多密集视觉任务的核心组件。然而,其低分辨率的空间特征输出要求进行特征上采样,以生成下游任务所需的高分辨率模态。在本研究中,我们提出了JAFAR,一种轻量级且灵活的特征上采样器,它能够将任何基础视觉编码器的视觉特征空间分辨率提升至任意目标分辨率。JAFAR采用了一种基于注意力的模块,旨在通过空间特征变换(SFT)调制,促进源自低层次图像特征的高分辨率查询与语义丰富的低分辨率键之间的语义对齐。值得注意的是,尽管缺乏高分辨率监督,我们证明了在低上采样比率和分辨率下的学习能够显著泛化到更高的输出尺度。大量实验表明,JAFAR有效恢复了细粒度的空间细节,并在多种下游任务中持续超越现有的特征上采样方法。项目页面请访问:https://jafar-upsampler.github.io。
English
Foundation Vision Encoders have become essential for a wide range of dense
vision tasks. However, their low-resolution spatial feature outputs necessitate
feature upsampling to produce the high-resolution modalities required for
downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible
feature upsampler that enhances the spatial resolution of visual features from
any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs
an attention-based module designed to promote semantic alignment between
high-resolution queries, derived from low-level image features, and
semantically enriched low-resolution keys, using Spatial Feature Transform
(SFT) modulation. Notably, despite the absence of high-resolution supervision,
we demonstrate that learning at low upsampling ratios and resolutions
generalizes remarkably well to significantly higher output scales. Extensive
experiments show that JAFAR effectively recovers fine-grained spatial details
and consistently outperforms existing feature upsampling methods across a
diverse set of downstream tasks. Project page at
https://jafar-upsampler.github.io