协同多模态编码实现高质量三维生成

摘要

三维内容本质上具有多模态特性，能够映射到不同的模态中（例如RGB图像、RGBD图像和点云）。每种模态在三维资产建模中展现出独特的优势：RGB图像蕴含丰富的三维纹理，而点云则定义了精细的三维几何结构。然而，现有的大多数原生三维生成架构要么主要局限于单一模态范式，从而忽视了多模态数据的互补优势；要么仅专注于三维结构，限制了可用训练数据集的规模。为了全面利用多模态进行三维建模，我们提出了TriMM，这是首个从基础多模态（如RGB、RGBD和点云）中学习的原生三维前馈生成模型。具体而言，1) TriMM首先引入了协作多模态编码，该编码在整合各模态特定特征的同时，保留了它们独特的表示优势。2) 此外，通过引入辅助的二维和三维监督，提升了多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码，TriMM采用三平面潜在扩散模型生成高质量的三维资产，显著增强了纹理和几何细节的表现。在多个知名数据集上的广泛实验表明，TriMM通过有效利用多模态，在仅使用少量训练数据的情况下，实现了与大规模数据集训练模型相媲美的性能。此外，我们在最新的RGB-D数据集上进行了额外实验，验证了将其他多模态数据集融入三维生成的可行性。

English

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

协同多模态编码实现高质量三维生成

Collaborative Multi-Modal Coding for High-Quality 3D Generation

摘要

Support