MuRF: 비전 파운데이션 모델의 다중 스케일 잠재력 개방

초록

비전 파운데이션 모델(VFM)은 다양한 작업에서 강력한 표현력을 제공하며 현대 컴퓨터 비전의 초석이 되었습니다. 최근 발전으로 이러한 모델이 학습 중 다양한 입력 크기를 처리할 수 있게 되었지만, 추론 시에는 일반적으로 단일 고정 크기로 제한되는 것이 일반적입니다. 이러한 보편적인 단일 스케일 패러다임은 시각 인지의 기본 속성인 상호 보완적 귀납 편향을 제공하는 다양한 해상도를 간과합니다. 저해상도 뷰는 전역적 의미 인식에 뛰어나고 고해상도 뷰는 세밀한 정제에 필수적입니다. 본 연구에서는 추론 시점에 이러한 시너지 효과를 활용하기 위한 간단하면서도 보편적으로 효과적인 전략인 다중 해상도 융합(MuRF)을 제안합니다. MuRF는 단일 뷰에 의존하는 대신, 고정된 VFM을 통해 여러 해상도로 이미지를 처리하고 결과 특징을 융합하여 통합 표현을 구성합니다. MuRF의 가장 큰 장점은 그 보편성에 있습니다. 특정 아키텍처에 종속되지 않고 시각 표현에 대한 기본적인 학습 불필개 향상 기술로 기능합니다. 우리는 MuRF를 여러 다른 VFM 패밀리(DINOv2를 주축으로, SigLIP2와 같은 대조 학습 모델로의 성공적 일반화도 증명)에 걸쳐 광범위한 컴퓨터 비전 핵심 작업에 적용하여 이를 실증적으로 입증했습니다.

English

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

MuRF: 비전 파운데이션 모델의 다중 스케일 잠재력 개방

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

초록

Support