モーファブル物体事前分布に基づくカメラ空間におけるカテゴリーレベルの3次元対応

要旨

画像からの3D物体理解は、ロボット工学やAR/VRアプリケーションにおいて基礎的な重要性を持つ。近年の研究ではカテゴリレベルの姿勢推定が進展しているものの、物体の部品、機能、相互作用に関する推論に必要な細粒度の意味情報を捉える表現は依然として不足している。本研究では、カメラ空間におけるカテゴリレベルの3D対応関係—単一画像から、同一カテゴリ内のインスタンス間で一貫した3D位置を予測すること—を扱い、共有可能な可変形物体事前分布を学習することで、明示的な対応関係の教師なしにそれが出現しうることを示す。この方向の研究を促進するため、我々はHouseCorr3Dを導入する。これは、50の家庭用物体カテゴリ、280の個別インスタンスにわたり178,000枚の画像を含み、CADモデル上に直接3Dキーポイントアノテーションを付与した、単眼カテゴリレベル3D対応関係のための初の大規模ベンチマークである。重要な点として、HouseCorr3Dは、遮蔽領域に対するアモーダル対応ラベルと明示的な対称性アノテーションを提供し、既存データセットの主要な制限に対処する。さらに我々はMorpheusを提案する。これは、標準形状、変形、物体姿勢を分離することにより、可変形カテゴリレベル形状事前分布を学習する手法である。この共有標準基底を通じて、カメラ空間における意味的に意味のある3D対応関係が暗黙的に出現する。これらの出現する3D対応関係はHouseCorr3Dにおいて新たな最先端を達成し、直接的な対応関係の教師なしでも意味的3D物体理解が生じうることを実証する。データとコードはhttps://github.com/GenIntel/HouseCorr3Dで公開されている。

English

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.