MARCO：探索语义对应无形空间的新航向

摘要

近年来语义匹配技术的进步主要依赖于双编码器架构，即将DINOv2与扩散模型主干网络相结合。尽管这些拥有数十亿参数的模型精度可观，但其泛化能力在训练关键点之外表现欠佳，暴露出基准测试性能与实际应用场景之间的差距——现实应用中查询的点位往往与训练数据存在差异。基于DINOv2架构，我们提出统一模型MARCO，通过创新性训练框架同时提升细粒度定位能力与语义泛化性能。该方案采用由粗到细的优化目标提升空间精度，结合自蒸馏框架将稀疏标注监督扩展至未标注区域，从而将少量关键点转化为密集的语义一致性对应关系。MARCO在SPair-71k、AP-10K和PF-PASCAL数据集上刷新了最优成绩：细粒度定位阈值提升8.9个PCK@0.01点，对未见关键点（SPair-U +5.1）和未知类别（MP-100 +4.7）的泛化能力表现最佳，且模型体积比基于扩散的方法缩小3倍、推理速度提升10倍。代码已开源：https://github.com/visinf/MARCO。

English

Recent advances in semantic correspondence rely on dual-encoder architectures, combining DINOv2 with diffusion backbones. While accurate, these billion-parameter models generalize poorly beyond training keypoints, revealing a gap between benchmark performance and real-world usability, where queried points rarely match those seen during training. Building upon DINOv2, we introduce MARCO, a unified model for generalizable correspondence driven by a novel training framework that enhances both fine-grained localization and semantic generalization. By coupling a coarse-to-fine objective that refines spatial precision with a self-distillation framework, which expands sparse supervision beyond annotated regions, our approach transforms a handful of keypoints into dense, semantically coherent correspondences. MARCO sets a new state of the art on SPair-71k, AP-10K, and PF-PASCAL, with gains that amplify at fine-grained localization thresholds (+8.9 PCK@0.01), strongest generalization to unseen keypoints (+5.1, SPair-U) and categories (+4.7, MP-100), while remaining 3x smaller and 10x faster than diffusion-based approaches. Code is available at https://github.com/visinf/MARCO .