MonoArt: Progressieve Structurele Redenering voor Monoculaire Gearticuleerde 3D-reconstructie

Samenvatting

Het reconstrueren van gearticuleerde 3D-objecten uit een enkele afbeelding vereist het gezamenlijk afleiden van de objectgeometrie, deelstructuur en bewegingsparameters op basis van beperkte visuele informatie. Een belangrijke moeilijkheid schuilt in de verstrengeling tussen bewegingsaanwijzingen en objectstructuur, wat directe regressie van articulatie instabiel maakt. Bestaande methoden pakken deze uitdaging aan via multi-view supervisie, op retrieval gebaseerde assemblage, of de generatie van hulpvideo's, waarbij vaak schaalbaarheid of efficiëntie wordt opgeofferd. Wij presenteren MonoArt, een uniform raamwerk gebaseerd op progressieve structurele redenering. In plaats van articulatie rechtstreeks uit beeldkenmerken te voorspellen, transformeert MonoArt visuele waarnemingen progressief in canonieke geometrie, gestructureerde deelrepresentaties en bewegingbewuste embeddingen binnen een enkele architectuur. Dit gestructureerde redeneerproces maakt stabiele en interpreteerbare articulatie-inferentie mogelijk zonder externe bewegingssjablonen of meerstappenpijplijnen. Uitgebreide experimenten op PartNet-Mobility tonen aan dat MonoArt state-of-the-art prestaties bereikt op zowel reconstructienauwkeurigheid als inferentiesnelheid. Het raamwerk generaliseert verder naar robotmanipulatie en de reconstructie van gearticuleerde scènes.

English

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

MonoArt: Progressieve Structurele Redenering voor Monoculaire Gearticuleerde 3D-reconstructie

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Samenvatting

Support