기하학이 중요하다: 의미론적 대응 학습을 위한 3D 기초 사전 지식

초록

자기 지도 비전 모델과 텍스트-이미지 확산 모델의 기초 특징들은 의미론적 대응점 추정에 효과적임이 입증되었다. 그러나 이러한 특징들은 주로 2D 이미지 목적 함수로 학습되기 때문에 명시적인 3D 인식이 부족하며, 3D에서는 구별되는 대칭적인 객체 측면, 반복되는 부분, 시각적으로 유사한 구조를 종종 혼동한다. 본 논문에서는 3D 기초 모델의 사전 지식을 통합하여 기존 2D 기초 특징들을 넘어서는 3D 인식 사후 학습 프레임워크를 소개한다. 주어진 이미지에 대해, 우리의 방법은 SAM3D를 사용하여 객체 형상과 포즈를 추정하고, 렌더링-비교 최적화를 통해 포즈를 정제한다. 이후, 추정된 객체 포즈를 기반으로 재구성된 형상의 PartField 설명자를 이미지 평면에 렌더링한다. 결과적으로 얻어진 형상 인식 특징 맵은 DINO 및 Stable Diffusion 특징을 보완하며, 재구성된 형상에서의 측지 거리는 후보 대응점의 신뢰성 있는 필터링을 가능하게 한다. 필터링된 정합 쌍을 감독 신호로 사용하여 DINO 및 Stable Diffusion 위에 경량 어댑터를 학습시켜 의미론적 대응을 수행한다. 포즈 주석을 필요로 하고 거친 구형 형상에 의존하는 기존 사후 학습 접근법과 달리, 우리의 방법은 자동으로 인스턴스별 3D 구조를 얻고 이를 사용하여 대응 학습을 안내한다. 실험 결과, 우리의 접근법이 수동 형상 감독을 줄이면서 기존 방법보다 의미론적 대응 성능을 향상시킴을 보여준다. 코드와 모델은 https://github.com/GenIntel/3D-SC에서 확인할 수 있다.

English

Foundation features from self-supervised vision models and text-to-image diffusion models have proven effective for semantic correspondence estimation. However, because these features are learned primarily from 2D image objectives, they lack explicit 3D awareness and often confuse symmetric object sides, repeated parts, and visually similar structures that are distinct in 3D. We introduce a 3D-aware post-training framework that goes beyond available 2D foundation features by incorporating priors from 3D foundation models. Given an image, our method uses SAM3D to estimate object geometry and pose, and refines the pose through render-and-compare optimization. Subsequently, we render PartField descriptors from the reconstructed geometry into the image plane based on the estimated object pose. The resulting geometry-aware feature maps complement DINO and Stable Diffusion features, while geodesic distances on the reconstructed shapes enable reliable filtering of candidate correspondences. We use the filtered matches as supervision to train a lightweight adapter on top of DINO and Stable Diffusion for semantic correspondence. In contrast to prior post-training approaches that require pose annotations and rely on coarse spherical geometry, our method automatically obtains instance-specific 3D structure and uses it to guide correspondence learning. Experiments show that our approach improves semantic correspondence over the prior methods while reducing manual geometric supervision. Code and model can be found at https:/github.com/GenIntel/3D-SC.