ChatPaper.aiChatPaper

協作式多模態編碼實現高品質3D生成

Collaborative Multi-Modal Coding for High-Quality 3D Generation

August 21, 2025
作者: Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu
cs.AI

摘要

3D內容本質上具有多模態特性,能夠投射到不同的模態中(例如RGB圖像、RGBD和點雲)。每種模態在3D資產建模中都展現出獨特的優勢:RGB圖像包含生動的3D紋理,而點雲則定義了細粒度的3D幾何結構。然而,大多數現有的3D原生生成架構要麼主要在單一模態範式下運作——從而忽視了多模態數據的互補優勢——要麼僅限於3D結構,從而限制了可用訓練數據集的範圍。為了全面利用多模態進行3D建模,我們提出了TriMM,這是第一個從基本多模態(例如RGB、RGBD和點雲)中學習的前饋式3D原生生成模型。具體而言,1) TriMM首先引入了協作多模態編碼,該方法整合了模態特定特徵,同時保留了它們獨特的表示優勢。2) 此外,引入了輔助的2D和3D監督,以提高多模態編碼的魯棒性和性能。3) 基於嵌入的多模態代碼,TriMM採用三平面潛在擴散模型來生成高質量的3D資產,增強了紋理和幾何細節。在多個知名數據集上的廣泛實驗表明,TriMM通過有效利用多模態,儘管使用了少量訓練數據,仍能與大規模數據集訓練的模型競爭性能。此外,我們在最新的RGB-D數據集上進行了額外實驗,驗證了將其他多模態數據集納入3D生成的可行性。
English
3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.
PDF42August 29, 2025