MuRF:释放视觉基础模型的多尺度潜能
MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
March 26, 2026
作者: Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee
cs.AI
摘要
视觉基础模型(VFMs)已成为现代计算机视觉的基石,为各类任务提供强大的表征能力。尽管最新进展允许这些模型在训练时处理不同尺寸的输入,但推理过程通常仍局限于单一固定尺度。这种普遍的单尺度范式忽略了视觉感知的基本特性:不同分辨率具有互补的归纳偏置——低分辨率视图擅长全局语义识别,而高分辨率视图对细粒度优化至关重要。本研究提出多分辨率融合(MuRF),一种简单却普遍有效的推理策略,旨在利用这种协同效应。MuRF不依赖单一视图,而是通过冻结的VFM对图像进行多分辨率处理并融合特征,构建统一表征。其最突出的特性在于普适性:它不依赖特定架构,而是作为视觉表征的一种基础性、免训练增强方法。我们通过将MuRF应用于多个VFM家族(以DINOv2为主,同时成功推广至SigLIP2等对比模型)涵盖的广泛计算机视觉关键任务,实证验证了其有效性。
English
Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.