MARCO: セマンティック対応の見えない空間をナビゲートする

要旨

意味的対応における最近の進歩は、DINOv2と拡散バックボーンを組み合わせたデュアルエンコアアーキテクチャに依存している。これらの数十億パラメータモデルは精度が高い一方で、学習キーポイントを超える汎化性能に乏しく、ベンチマーク性能と実世界での有用性の間に隔たりが生じている。実世界では、問い合わせられる点が学習時に見た点と一致することは稀である。DINOv2を基盤として、我々はMARCOを提案する。これは、細粒度の位置特定と意味的汎化の両方を強化する新規学習フレームワークによって駆動される、汎化可能な対応のための統一モデルである。空間精度を高める粗密目的関数と、注釈付き領域を超えて疎な教師信号を拡張する自己蒸留フレームワークを組み合わせることで、本手法は少数のキーポイントを密で意味的に一貫性のある対応関係に変換する。MARCOは、SPair-71k、AP-10K、PF-PASCALにおいて新たなstate-of-the-artを達成し、その利得は細粒度位置特定閾値（+8.9 PCK@0.01）で増幅し、未見キーポイント（SPair-U: +5.1）およびカテゴリ（MP-100: +4.7）への汎化性能で最高の結果を示した。さらに、拡散ベースの手法と比較して3倍小さく、10倍高速である。コードはhttps://github.com/visinf/MARCO で公開されている。

English

Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which expands sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+8.9 PCK@0.01), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches. Code is available at https://github.com/visinf/MARCO .

MARCO: セマンティック対応の見えない空間をナビゲートする

MARCO: Navigating the Unseen Space of Semantic Correspondence

要旨

Support