協作式多模態編碼實現高品質3D生成

摘要

3D內容本質上具有多模態特性，能夠投射到不同的模態中（例如RGB圖像、RGBD和點雲）。每種模態在3D資產建模中都展現出獨特的優勢：RGB圖像包含生動的3D紋理，而點雲則定義了細粒度的3D幾何結構。然而，大多數現有的3D原生生成架構要麼主要在單一模態範式下運作——從而忽視了多模態數據的互補優勢——要麼僅限於3D結構，從而限制了可用訓練數據集的範圍。為了全面利用多模態進行3D建模，我們提出了TriMM，這是第一個從基本多模態（例如RGB、RGBD和點雲）中學習的前饋式3D原生生成模型。具體而言，1) TriMM首先引入了協作多模態編碼，該方法整合了模態特定特徵，同時保留了它們獨特的表示優勢。2) 此外，引入了輔助的2D和3D監督，以提高多模態編碼的魯棒性和性能。3) 基於嵌入的多模態代碼，TriMM採用三平面潛在擴散模型來生成高質量的3D資產，增強了紋理和幾何細節。在多個知名數據集上的廣泛實驗表明，TriMM通過有效利用多模態，儘管使用了少量訓練數據，仍能與大規模數據集訓練的模型競爭性能。此外，我們在最新的RGB-D數據集上進行了額外實驗，驗證了將其他多模態數據集納入3D生成的可行性。

English

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

協作式多模態編碼實現高品質3D生成

Collaborative Multi-Modal Coding for High-Quality 3D Generation

摘要

Support