SOCO：視覺基礎模型中的語義物件對應基準測試

摘要

衡量視覺基礎模型中的結構化物體理解仍然具有挑戰性，原因在於不一致的評估協議以及有限的部件層級監督。語義對應（Semantic Correspondence, SC）透過測試物體部件能否在外觀、視角和幾何形狀大幅變化的情況下，跨實例與類別進行匹配，來評估此能力。為實現系統化的語義對應評估，我們提出了SOCO——一個新的語義物體對應基準，該基準引入了對應類型的分類體系，並在100個類別與超過100萬個對應配對中，提供一致且具功能意義的關鍵點標註。此外，SOCO還包含關鍵點的語言描述，使我們能評估大型視覺語言模型（LVLMs）及其細粒度部件層級的理解能力。全面的實驗結果顯示：（i）視覺基礎主幹編碼了強大的語義結構，但在相關類別間的對應遷移效果不佳，且僅部分捕捉到物體部件的位置；（ii）LVLMs在文字提示的部件定位方面優於視覺參考的跨影像匹配，這揭示了語言基礎定位與細粒度視覺對應之間的差距；（iii）對應性能對密集的下游任務（包括分割、追蹤、3D姿態估計與3D偵測）的預測能力，強於ImageNet分類的預測能力。綜合這些發現，SOCO被定位為評估視覺與多模態基礎模型中結構化、部件層級表徵品質的基準。

English

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.