SOCO: 비전 파운데이션 모델에서의 의미론적 객체 대응 벤치마킹

초록

시각 기반 모델(Vision Foundation Models)에서 구조화된 객체 이해(Structured Object Understanding)를 측정하는 것은 일관되지 않은 평가 프로토콜과 제한적인 부위 수준(Part-level)의 지도 학습으로 인해 여전히 어려운 과제로 남아 있다. 의미론적 대응(Semantic Correspondence, SC)은 객체의 부위가 외형, 시점, 기하학적 구조의 큰 변화에도 인스턴스 및 카테고리 간에 일치될 수 있는지를 테스트함으로써 이러한 능력을 평가한다. 체계적인 SC 평가를 가능하게 하기 위해, 우리는 SOCO(Semantic Object Correspondence)라는 새로운 벤치마크를 제안한다. SOCO는 대응 유형의 분류 체계(Taxonomy)를 도입하고, 100개 카테고리와 100만 개 이상의 대응 쌍에 걸쳐 일관되고 기능적으로 의미 있는 키포인트 주석(Keypoint Annotations)을 제공한다. 또한, SOCO는 키포인트 언어 설명(Keypoint Language Descriptions)을 포함하여 대규모 시각-언어 모델(Large Vision-Language Models, LVLMs)과 이들의 세분화된 부위 수준 이해 능력을 평가할 수 있게 한다. 포괄적인 실험 결과는 다음을 보여준다. (i) 시각 기반 백본(Vision Foundation Backbones)은 강력한 의미 구조를 인코딩하지만, 관련 카테고리 간의 대응 전이(Transfer Correspondences)는 제대로 수행하지 못하며 객체-부위 위치(Object-Part Position)를 부분적으로만 포착한다. (ii) LVLMs는 시각적 참조 교차 이미지 매칭(Visual-Reference Cross-Image Matching)보다 텍스트 프롬프트 기반 부위 위치 파악(Text-Prompted Part Localization)에서 더 강력하며, 이는 언어 기반 위치 파악(Language-Grounded Localization)과 세분화된 시각적 대응(Fine-Grained Visual Correspondence) 사이의 격차를 드러낸다. (iii) 대응 성능은 ImageNet 분류보다 세분화(Segmentation), 추적(Tracking), 3D 자세 추정(3D Pose Estimation), 3D 탐지(3D Detection)를 포함한 밀집 하위 과제(Dense Downstream Tasks)의 성능을 더 강력하게 예측한다. 이러한 발견들을 종합하면, SOCO는 시각 및 다중 모드 기반 모델(Vision and Multimodal Foundation Models)에서 구조화된 부위 수준 표현 품질(Structured, Part-Level Representation Quality)을 평가하는 벤치마크로 자리매김한다.

English

Measuring structured object understanding in vision foundation models remains challenging due to inconsistent evaluation protocols and limited part-level supervision. Semantic correspondence (SC) evaluates this capability by testing whether object parts can be matched across instances and categories under large variations in appearance, viewpoint, and geometry. To enable a systematic SC evaluation, we introduce SOCO, a new benchmark for Semantic Object Correspondence that introduces a taxonomy of correspondence types and provides consistent, functionally meaningful keypoint annotations across 100 categories and over 1M correspondence pairs. In addition, SOCO includes keypoint language descriptions, enabling the evaluation of large vision-language models (LVLMs) and their fine-grained part-level understanding. Comprehensive experiments reveal that (i) vision foundation backbones encode strong semantic structure but transfer correspondences poorly across related categories and only partially capture object-part position, (ii) LVLMs are stronger at text-prompted part localization than at visual-reference cross-image matching, exposing a gap between language-grounded localization and fine-grained visual correspondence, and (iii) correspondence performance predicts performance on dense downstream tasks, including segmentation, tracking, 3D pose estimation, and 3D detection, more strongly than ImageNet classification. Together, these findings position SOCO as a benchmark for structured, part-level representation quality in vision and multimodal foundation models.