SOCO: 视觉基础模型中语义对象对应关系的基准测试

摘要

在视觉基础模型中测量结构化对象理解能力仍面临挑战，原因在于评估协议不一致以及部分级标注有限。语义对应（SC）通过检验对象部件能否在实例和类别间跨外观、视角和几何形态的大幅变化中进行匹配，来评估这一能力。为支持系统化的SC评估，我们引入了SOCO——一个面向语义对象对应的新基准。该基准提出了对应类型的分类体系，并在100个类别和超过100万对对应关系上提供了一致且功能上有意义的关键点标注。此外，SOCO还包含关键点语言描述，使得评估大型视觉语言模型（LVLMs）及其细粒度部件级理解能力成为可能。综合实验表明：（i）视觉基础骨干网络编码了强语义结构，但在相关类别间传递对应关系时表现不佳，且仅部分捕获了对象部件的位置；（ii）LVLMs在文本提示的部件定位方面强于基于视觉参考的跨图像匹配，揭示了语言引导定位与细粒度视觉对应之间的差距；（iii）对应关系性能对密集下游任务（包括分割、跟踪、3D姿态估计和3D检测）的预测能力优于ImageNet分类。综合来看，这些发现将SOCO定位为评估视觉和多模态基础模型中结构化部件级表示质量的基准。

English

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.