변형 가능한 객체 사전을 통한 카테고리 수준의 카메라 공간 3D 대응

초록

이미지로부터 3D 객체를 이해하는 것은 로봇공학 및 AR/VR 응용 분야의 핵심 과제이다. 최근 연구들은 범주 수준의 자세 추정에서 진전을 이루었지만, 현재의 표현 방식은 객체의 부품, 기능, 상호작용에 대한 추론에 필요한 세부 의미를 포착하지 못한다. 본 연구에서는 카메라 공간에서의 범주 수준 3D 대응(category-level 3D correspondence)을 다룬다. 즉, 단일 이미지로부터 해당 범주 내 객체 인스턴스들 간에 일관된 3D 위치를 예측하는 것이며, 명시적인 대응 지도 학습 없이 공유 가능한 변형 객체 사전(morphable object prior)을 학습함으로써 이러한 대응이 자연스럽게 출현할 수 있음을 보인다. 이 방향의 연구를 촉진하기 위해, 우리는 HouseCorr3D를 소개한다. 이는 단안 범주 수준 3D 대응을 위한 최초의 대규모 벤치마크로, 50개의 가정용 객체 범주, 280개의 고유 인스턴스, 178k개의 이미지를 포함하며, CAD 모델에 직접 3D 키포인트 주석이 제공된다. 특히 HouseCorr3D는 가려진 영역에 대한 비가시적 대응 레이블(amodal correspondence labels)과 명시적 대칭 주석을 제공하여 기존 데이터셋의 주요 한계를 해결한다. 또한 우리는 표준 형태, 변형, 객체 자세를 분리(disentangling)함으로써 변형 가능한 범주 수준 형태 사전을 학습하는 Morpheus 방법을 제안한다. 이러한 공유 표준 기준점(shared canonical grounding)을 통해 카메라 공간에서 의미론적으로 의미 있는 3D 대응이 암시적으로 출현한다. 이렇게 출현한 3D 대응은 HouseCorr3D에서 새로운 최첨단 성능을 달성하며, 직접적인 대응 지도 학습 없이도 의미론적 3D 객체 이해가 가능함을 입증한다. 데이터와 코드는 https://github.com/GenIntel/HouseCorr3D 에서 공개적으로 이용 가능하다.

English

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.