OmniGlue: 基盤モデルによるガイダンスを活用した汎用可能な特徴マッチング

要旨

画像マッチング分野では、新しい学習可能な特徴マッチング技術が次々と登場し、従来のベンチマークにおける性能が着実に向上しています。しかし、私たちの調査によると、これらの進歩にもかかわらず、現実世界のアプリケーションへの適用可能性は、新しい画像ドメインへの汎化能力の限界によって制約されています。本論文では、汎化を中核原則として設計された初の学習可能な画像マッチャーであるOmniGlueを紹介します。OmniGlueは、視覚基盤モデルからの広範な知識を活用して特徴マッチングプロセスを導き、トレーニング時に見られなかったドメインへの汎化を促進します。さらに、空間情報と外観情報を分離することで、マッチング記述子を強化する新しいキーポイント位置誘導型アテンションメカニズムを提案します。シーンレベル、オブジェクト中心、航空画像など、多様な画像ドメインを含む7つのデータセットで包括的な実験を行いました。OmniGlueの新しいコンポーネントにより、直接比較可能な参照モデルに対して未見のドメインで20.9%の相対的な向上を達成し、最近のLightGlueメソッドよりも9.5%優れた性能を示しました。コードとモデルはhttps://hwjiang1510.github.io/OmniGlueで公開されています。

English

The image matching field has been witnessing a continuous emergence of novel learnable feature matching techniques, with ever-improving performance on conventional benchmarks. However, our investigation shows that despite these gains, their potential for real-world applications is restricted by their limited generalization capabilities to novel image domains. In this paper, we introduce OmniGlue, the first learnable image matcher that is designed with generalization as a core principle. OmniGlue leverages broad knowledge from a vision foundation model to guide the feature matching process, boosting generalization to domains not seen at training time. Additionally, we propose a novel keypoint position-guided attention mechanism which disentangles spatial and appearance information, leading to enhanced matching descriptors. We perform comprehensive experiments on a suite of 7 datasets with varied image domains, including scene-level, object-centric and aerial images. OmniGlue's novel components lead to relative gains on unseen domains of 20.9% with respect to a directly comparable reference model, while also outperforming the recent LightGlue method by 9.5% relatively.Code and model can be found at https://hwjiang1510.github.io/OmniGlue

OmniGlue: 基盤モデルによるガイダンスを活用した汎用可能な特徴マッチング

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

要旨

Support