1から複数へ：3D生成のための文脈的パーツ潜在表現

要旨

3D生成の最近の進展は、マルチビュー2Dレンダリングアプローチから、グラウンドトゥルースデータの幾何学的プリオーを活用する3Dネイティブな潜在拡散フレームワークへと移行してきました。しかしながら、3つの主要な課題が依然として存在します：(1) 単一の潜在表現では複雑なマルチパートの幾何学を捉えきれず、詳細が劣化する、(2) 全体的な潜在符号化では、構成設計に不可欠なパートの独立性と相互関係が無視される、(3) グローバルな条件付けメカニズムでは、きめ細かい制御性が欠如している。人間の3D設計ワークフローに着想を得て、我々はCoPartを提案します。これは、3Dオブジェクトを文脈的なパート潜在に分解し、一貫性のあるマルチパート生成を実現するパート認識拡散フレームワークです。このパラダイムは3つの利点を提供します：i) パート分解による符号化の複雑さの低減、ii) 明示的なパート関係のモデリングの実現、iii) パートレベルの条件付けのサポート。さらに、事前学習済み拡散モデルを微調整して、ジョイントパート潜在のノイズ除去を行う相互ガイダンス戦略を開発し、幾何学的な一貫性と基盤モデルのプリオーを両立させます。大規模なトレーニングを可能にするため、Objaverseから自動メッシュセグメンテーションと人間による検証済みアノテーションを通じて、新規の3DパートデータセットであるPartverseを構築しました。広範な実験により、CoPartがパートレベルの編集、関節付きオブジェクト生成、シーン構成において、前例のない制御性を備えた優れた能力を発揮することが実証されました。

English

Recent advances in 3D generation have transitioned from multi-view 2D rendering approaches to 3D-native latent diffusion frameworks that exploit geometric priors in ground truth data. Despite progress, three key limitations persist: (1) Single-latent representations fail to capture complex multi-part geometries, causing detail degradation; (2) Holistic latent coding neglects part independence and interrelationships critical for compositional design; (3) Global conditioning mechanisms lack fine-grained controllability. Inspired by human 3D design workflows, we propose CoPart - a part-aware diffusion framework that decomposes 3D objects into contextual part latents for coherent multi-part generation. This paradigm offers three advantages: i) Reduces encoding complexity through part decomposition; ii) Enables explicit part relationship modeling; iii) Supports part-level conditioning. We further develop a mutual guidance strategy to fine-tune pre-trained diffusion models for joint part latent denoising, ensuring both geometric coherence and foundation model priors. To enable large-scale training, we construct Partverse - a novel 3D part dataset derived from Objaverse through automated mesh segmentation and human-verified annotations. Extensive experiments demonstrate CoPart's superior capabilities in part-level editing, articulated object generation, and scene composition with unprecedented controllability.