Michelangelo：基於形狀-圖像-文本對齊潛在表示的條件3D形狀生成

摘要

我們提出了一種新穎的對齊-生成方法，以應對基於2D圖像或文本生成通用3D形狀的具有挑戰性任務。直接從圖像或文本到3D形狀學習條件生成模型容易產生與條件不一致的結果，因為3D形狀具有一個額外的維度，其分佈與2D圖像和文本顯著不同。為了彌合三種模態之間的領域差異，並促進多模態條件下的3D形狀生成，我們探索在一個形狀-圖像-文本對齊空間中表示3D形狀。我們的框架包括兩個模型：一個形狀-圖像-文本對齊變分自編碼器（SITA-VAE）和一個條件對齊形狀潛在擴散模型（ASLDM）。前者將3D形狀編碼為與圖像和文本對齊的形狀潛在空間，並通過基於變壓器的解碼器重構對應於給定形狀嵌入的精紆3D神經場。後者從圖像或文本空間到潛在形狀空間學習概率映射函數。我們的大量實驗表明，我們提出的方法可以生成更高質量和更多樣化的3D形狀，更好地語義地符合視覺或文本條件輸入，驗證了形狀-圖像-文本對齊空間對跨模態3D形狀生成的有效性。

English

We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

Michelangelo：基於形狀-圖像-文本對齊潛在表示的條件3D形狀生成

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

摘要

Support