MuRF：释放视觉基础模型的多尺度潜能

摘要

视觉基础模型（VFMs）已成为现代计算机视觉的基石，为各类任务提供强大的表征能力。尽管最新进展允许这些模型在训练时处理不同尺寸的输入，但推理过程通常仍局限于单一固定尺度。这种普遍的单尺度范式忽略了视觉感知的基本特性：不同分辨率具有互补的归纳偏置——低分辨率视图擅长全局语义识别，而高分辨率视图对细粒度优化至关重要。本研究提出多分辨率融合（MuRF），一种简单却普遍有效的推理策略，旨在利用这种协同效应。MuRF不依赖单一视图，而是通过冻结的VFM对图像进行多分辨率处理并融合特征，构建统一表征。其最突出的特性在于普适性：它不依赖特定架构，而是作为视觉表征的一种基础性、免训练增强方法。我们通过将MuRF应用于多个VFM家族（以DINOv2为主，同时成功推广至SigLIP2等对比模型）涵盖的广泛计算机视觉关键任务，实证验证了其有效性。

English

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

MuRF：释放视觉基础模型的多尺度潜能

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

摘要

Support