ChatPaper.aiChatPaper

协同多模态编码实现高质量三维生成

Collaborative Multi-Modal Coding for High-Quality 3D Generation

August 21, 2025
作者: Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu
cs.AI

摘要

三维内容本质上具有多模态特性,能够映射到不同的模态中(例如RGB图像、RGBD图像和点云)。每种模态在三维资产建模中展现出独特的优势:RGB图像蕴含丰富的三维纹理,而点云则定义了精细的三维几何结构。然而,现有的大多数原生三维生成架构要么主要局限于单一模态范式,从而忽视了多模态数据的互补优势;要么仅专注于三维结构,限制了可用训练数据集的规模。为了全面利用多模态进行三维建模,我们提出了TriMM,这是首个从基础多模态(如RGB、RGBD和点云)中学习的原生三维前馈生成模型。具体而言,1) TriMM首先引入了协作多模态编码,该编码在整合各模态特定特征的同时,保留了它们独特的表示优势。2) 此外,通过引入辅助的二维和三维监督,提升了多模态编码的鲁棒性和性能。3) 基于嵌入的多模态编码,TriMM采用三平面潜在扩散模型生成高质量的三维资产,显著增强了纹理和几何细节的表现。在多个知名数据集上的广泛实验表明,TriMM通过有效利用多模态,在仅使用少量训练数据的情况下,实现了与大规模数据集训练模型相媲美的性能。此外,我们在最新的RGB-D数据集上进行了额外实验,验证了将其他多模态数据集融入三维生成的可行性。
English
3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.
PDF42August 29, 2025