Categorie-niveau 3D-correspondentie in Cameraruimte via Vervormbare Object-Priors

Samenvatting

Het begrijpen van 3D-objecten op basis van afbeeldingen is fundamenteel voor robotica en AR/VR-toepassingen. Hoewel recent werk vooruitgang heeft geboekt op het gebied van pose-schatting op categorieniveau, slagen huidige representaties er niet in de fijnmazige semantiek vast te leggen die nodig is om te redeneren over objectonderdelen, functies en interacties. In dit werk bestuderen we 3D-correspondentie op categorieniveau in cameraruimte – het voorspellen, op basis van een enkele afbeelding, van 3D-locaties die consistent blijven over objecten binnen een categorie – en tonen we aan dat deze kan ontstaan zonder expliciete correspondentietoezicht door het leren van een gedeelde, vervormbare objectprior. Om onderzoek in deze richting mogelijk te maken, introduceren we HouseCorr3D, de eerste grootschalige benchmark voor monoculaire 3D-correspondentie op categorieniveau, met 178k afbeeldingen in 50 huishoudelijke objectcategorieën, 280 unieke objecten en 3D-sleutelpuntannotaties direct op CAD-modellen. Cruciaal is dat HouseCorr3D amodale correspondentielabels voor occlusies en expliciete symmetrieannotaties biedt, waarmee belangrijke beperkingen van bestaande datasets worden aangepakt. Verder stellen we Morpheus voor, een methode die vervormbare vormpriors op categorieniveau leert door canonieke vorm, vervorming en objectpose te ontwarren. Door deze gedeelde canonieke verankering ontstaan impliciet semantisch betekenisvolle 3D-correspondenties in cameraruimte. Deze opkomende 3D-correspondenties bepalen een nieuwe state-of-the-art op HouseCorr3D, wat aantoont dat semantisch 3D-objectbegrip kan ontstaan zonder directe correspondentietoezicht. Data en code zijn openbaar beschikbaar op https://github.com/GenIntel/HouseCorr3D.

English

Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space -- predicting, from a single image, 3D locations that remain consistent across instances within a category -- and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.