Michelangelo：基于形状-图像-文本对齐潜在表示的条件3D形状生成

摘要

我们提出了一种新颖的“对齐-生成”方法，用于解决基于2D图像或文本生成通用3D形状的挑战性任务。直接从图像或文本到3D形状学习条件生成模型容易因为3D形状具有额外维度，其分布与2D图像和文本显著不同，导致生成结果不一致。为了弥合三种模态之间的领域差距，促进多模态条件下的3D形状生成，我们探索在一个形状-图像-文本对齐空间中表示3D形状。我们的框架包括两个模型：一个形状-图像-文本对齐变分自编码器（SITA-VAE）和一个条件对齐形状潜扩散模型（ASLDM）。前者将3D形状编码为与图像和文本对齐的形状潜空间，并通过基于Transformer的解码器重构对应于给定形状嵌入的细粒度3D神经场。后者从图像或文本空间学习到潜形状空间的概率映射函数。我们的大量实验证明，我们提出的方法可以生成更高质量、更多样化的3D形状，更好地语义地符合视觉或文本条件输入，验证了形状-图像-文本对齐空间在跨模态3D形状生成中的有效性。

English

We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

Michelangelo：基于形状-图像-文本对齐潜在表示的条件3D形状生成

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

摘要

Support