Michelangelo: 形状-画像-テキストに基づく条件付き3D形状生成アラインドされた潜在表現

要旨

本研究では、2D画像やテキストに基づいて一般的な3D形状を生成するという困難な課題に取り組むため、新たなアライメント・ビフォア・ジェネレーション（alignment-before-generation）アプローチを提案します。画像やテキストから直接3D形状を生成する条件付き生成モデルを学習すると、3D形状が持つ追加の次元の分布が2D画像やテキストと大きく異なるため、条件と一致しない結果が生じやすくなります。この3つのモダリティ間のドメインギャップを埋め、マルチモーダル条件付き3D形状生成を促進するため、我々は3D形状を形状-画像-テキストアライメント空間で表現する方法を探求します。提案するフレームワークは、Shape-Image-Text-Aligned Variational Auto-Encoder（SITA-VAE）と条件付きAligned Shape Latent Diffusion Model（ASLDM）の2つのモデルで構成されます。前者のモデルは、3D形状を画像とテキストにアライメントされた形状潜在空間にエンコードし、トランスフォーマーベースのデコーダを介して与えられた形状埋め込みに対応する詳細な3Dニューラルフィールドを再構築します。後者のモデルは、画像またはテキスト空間から潜在形状空間への確率的マッピング関数を学習します。我々の広範な実験により、提案手法がより高品質で多様な3D形状を生成し、視覚的またはテキスト的な条件入力に意味的に適合することを実証し、クロスモダリティ3D形状生成における形状-画像-テキストアライメント空間の有効性を検証しました。

English

We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

Michelangelo: 形状-画像-テキストに基づく条件付き3D形状生成アラインドされた潜在表現

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

要旨

Support