ULIP-2：朝向可擴展的多模態預訓練，用於3D理解。

摘要

最近在多模態預訓練方法方面取得的進展展示了在三維表示學習中對齊跨越三維模態、其二維對應模態和相應語言模態的特徵，具有潛在的有效性。然而，現有多模態預訓練框架用於為三維應用收集多模態數據的方法缺乏可擴展性和全面性，可能限制了多模態學習的全部潛力。主要瓶頸在於語言模態的可擴展性和全面性。為了解決這一瓶頸，我們引入了ULIP-2，一個多模態預訓練框架，利用最先進的多模態大型語言模型（LLMs）預先訓練了豐富知識，自動生成三維物體的整體語言對應物。我們在兩個大型數據集Objaverse和ShapeNet55上進行實驗，並發布我們生成的三模態三元組數據集（三維點雲 - 圖像 - 語言），名為"ULIP-Objaverse三元組"和"ULIP-ShapeNet三元組"。ULIP-2僅需要三維數據本身，消除了任何手動標註工作的需求，展示了其可擴展性；ULIP-2在ModelNet40上實現了顯著的零樣本分類改進（74％頂部1準確度）。此外，ULIP-2在現實世界的ScanObjectNN基準測試中創下了新紀錄（91.5％總體準確度），同時僅利用140萬參數（比當前最先進技術少10倍），標誌著在沒有人工標註的情況下實現可擴展的多模態三維表示學習的突破。代碼和數據集可在https://github.com/salesforce/ULIP 上獲得。

English

Recent advancements in multimodal pre-training methods have shown promising efficacy in 3D representation learning by aligning features across 3D modality, their 2D counterpart modality, and corresponding language modality. However, the methods used by existing multimodal pre-training frameworks to gather multimodal data for 3D applications lack scalability and comprehensiveness, potentially constraining the full potential of multimodal learning. The main bottleneck lies in the language modality's scalability and comprehensiveness. To address this bottleneck, we introduce ULIP-2, a multimodal pre-training framework that leverages state-of-the-art multimodal large language models (LLMs) pre-trained on extensive knowledge to automatically generate holistic language counterparts for 3D objects. We conduct experiments on two large-scale datasets, Objaverse and ShapeNet55, and release our generated three-modality triplet datasets (3D Point Cloud - Image - Language), named "ULIP-Objaverse Triplets" and "ULIP-ShapeNet Triplets". ULIP-2 requires only 3D data itself and eliminates the need for any manual annotation effort, demonstrating its scalability; and ULIP-2 achieves remarkable improvements on downstream zero-shot classification on ModelNet40 (74% Top1 Accuracy). Moreover, ULIP-2 sets a new record on the real-world ScanObjectNN benchmark (91.5% Overall Accuracy) while utilizing only 1.4 million parameters(~10x fewer than current SOTA), signifying a breakthrough in scalable multimodal 3D representation learning without human annotations. The code and datasets are available at https://github.com/salesforce/ULIP.

ULIP-2：朝向可擴展的多模態預訓練，用於3D理解。

ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding

摘要

Support