MonoArt: 単眼関節3D再構成のための段階的構造推論

要旨

単一画像から関節構造を持つ3Dオブジェクトを再構築するには、限られた視覚的証拠からオブジェクトの形状、部品構造、動作パラメータを統合的に推論する必要がある。主な難しさは、動作の手がかりとオブジェクト構造の絡み合いにあり、これが直接的な関節パラメータの回帰を不安定にする。既存手法では、マルチビュー監督、検索ベースの組み立て、補助的な動画生成などを通じてこの課題に対処するが、拡張性や効率性が犠牲になることが多い。我々は、漸進的構造推論に基づく統一フレームワーク「MonoArt」を提案する。MonoArtは、画像特徴から直接関節パラメータを予測するのではなく、視覚的観測を正準形状、構造化された部品表現、動作認識埋め込みへと単一アーキテクチャ内で段階的に変換する。この構造化された推論プロセスにより、外部の動作テンプレートや多段階パイプラインなしで、安定かつ解釈可能な関節推論を実現する。PartNet-Mobilityを用いた大規模実験により、OMが再構成精度と推論速度の両方で最先端の性能を達成することを示す。本フレームワークはさらに、ロボット把持や関節構造を持つシーン再構築へも一般化可能である。

English

Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

MonoArt: 単眼関節3D再構成のための段階的構造推論

MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

要旨

Support