MuRF: 視覚基盤モデルが秘めるマルチスケール可能性の解放

要旨

Vision Foundation Models（VFM）は、多様なタスクにおいて堅牢な表現を提供する現代のコンピュータビジョンの基盤となっている。近年の進歩により、これらのモデルは訓練時に様々な入力サイズを扱えるようになったが、推論時には通常、単一の固定スケールに制限されることが一般的である。この広く普及した単一スケールのパラダイムは、視覚認識の基本的な特性を見落としている。すなわち、異なる解像度は相補的な帰納バイアスを提供し、低解像度の視点は大域的な意味認識に優れ、高解像度の視点は細粒度の識別に不可欠なのである。本研究では、推論時にこの相乗効果を利用するための、単純でありながら普遍的で効果的な戦略であるMulti-Resolution Fusion（MuRF）を提案する。MuRFは単一の視点に依存する代わりに、固定されたVFMを用いて画像を複数の解像度で処理し、得られた特徴量を融合することで統合された表現を構築する。MuRFの最も説得力のある特性はその普遍性である。これは特定のアーキテクチャに依存せず、視覚表現に対する基本的で訓練不要な拡張として機能する。我々は、複数の異なるVFMファミリー（主にDINOv2、さらにSigLIP2のような対照モデルへの一般化の成功も実証）にわたる広範な重要なコンピュータビジョンタスクにMuRFを適用することで、これを実証的に検証する。

English

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting features. The universality of MuRF is its most compelling attribute. It is not tied to a specific architecture, serving instead as a fundamental, training-free enhancement to visual representation. We empirically validate this by applying MuRF to a broad spectrum of critical computer vision tasks across multiple distinct VFM families - primarily DINOv2, but also demonstrating successful generalization to contrastive models like SigLIP2.

MuRF: 視覚基盤モデルが秘めるマルチスケール可能性の解放

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

要旨

Support