미켈란젤로: 형태-이미지-텍스트 정렬 잠재 표현 기반 조건부 3D 형태 생성

초록

2D 이미지나 텍스트를 기반으로 일반적인 3D 형태를 생성하는 어려운 과제를 해결하기 위해, 우리는 새로운 정렬-후-생성(alignment-before-generation) 접근 방식을 제시합니다. 이미지나 텍스트에서 3D 형태로의 조건부 생성 모델을 직접 학습하는 것은, 3D 형태가 2D 이미지와 텍스트와는 크게 다른 분포를 가진 추가 차원을 가지고 있기 때문에, 조건과 일치하지 않는 결과를 생성하기 쉽습니다. 세 가지 모달리티 간의 도메인 격차를 해소하고 다중 모달리티 조건 하의 3D 형태 생성을 용이하게 하기 위해, 우리는 3D 형태를 형태-이미지-텍스트 정렬 공간(shape-image-text-aligned space)에서 표현하는 방법을 탐구합니다. 우리의 프레임워크는 두 가지 모델로 구성됩니다: 형태-이미지-텍스트 정렬 변분 자동 인코더(SITA-VAE)와 조건부 정렬 형태 잠재 확산 모델(ASLDM)입니다. 전자 모델은 3D 형태를 이미지와 텍스트에 정렬된 형태 잠재 공간으로 인코딩하고, 트랜스포머 기반 디코더를 통해 주어진 형태 임베딩에 해당하는 세밀한 3D 신경 필드를 재구성합니다. 후자 모델은 이미지나 텍스트 공간에서 잠재 형태 공간으로의 확률적 매핑 함수를 학습합니다. 우리의 광범위한 실험은 제안된 접근 방식이 시각적 또는 텍스트적 조건 입력에 더 잘 의미적으로 부합하는 더 높은 품질과 다양성을 가진 3D 형태를 생성할 수 있음을 보여주며, 교차 모달리티 3D 형태 생성을 위한 형태-이미지-텍스트 정렬 공간의 효과를 검증합니다.

English

We present a novel alignment-before-generation approach to tackle the challenging task of generating general 3D shapes based on 2D images or texts. Directly learning a conditional generative model from images or texts to 3D shapes is prone to producing inconsistent results with the conditions because 3D shapes have an additional dimension whose distribution significantly differs from that of 2D images and texts. To bridge the domain gap among the three modalities and facilitate multi-modal-conditioned 3D shape generation, we explore representing 3D shapes in a shape-image-text-aligned space. Our framework comprises two models: a Shape-Image-Text-Aligned Variational Auto-Encoder (SITA-VAE) and a conditional Aligned Shape Latent Diffusion Model (ASLDM). The former model encodes the 3D shapes into the shape latent space aligned to the image and text and reconstructs the fine-grained 3D neural fields corresponding to given shape embeddings via the transformer-based decoder. The latter model learns a probabilistic mapping function from the image or text space to the latent shape space. Our extensive experiments demonstrate that our proposed approach can generate higher-quality and more diverse 3D shapes that better semantically conform to the visual or textural conditional inputs, validating the effectiveness of the shape-image-text-aligned space for cross-modality 3D shape generation.

미켈란젤로: 형태-이미지-텍스트 정렬 잠재 표현 기반 조건부 3D 형태 생성

Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation

초록

Support