ULIP-2：面向可扩展的多模态预训练，用于3D理解。

摘要

最近的多模态预训练方法的进展显示出在三维表示学习中取得了有希望的效果，通过将三维模态、其二维对应模态以及相应的语言模态的特征进行对齐。然而，现有多模态预训练框架用于收集三维应用的多模态数据的方法缺乏可扩展性和全面性，可能限制了多模态学习的全部潜力。主要瓶颈在于语言模态的可扩展性和全面性。为了解决这一瓶颈，我们引入了ULIP-2，这是一个多模态预训练框架，利用最先进的大规模多模态语言模型（LLMs）在广泛知识上预训练，自动生成三维物体的整体语言对应物。我们在两个大规模数据集Objaverse和ShapeNet55上进行实验，并发布了我们生成的三模态三元组数据集（三维点云 - 图像 - 语言），命名为"ULIP-Objaverse三元组"和"ULIP-ShapeNet三元组"。ULIP-2仅需要三维数据本身，消除了任何手动注释的需求，展示了其可扩展性；ULIP-2在ModelNet40上实现了显著的零样本分类改进（74% Top1准确率）。此外，ULIP-2在现实世界的ScanObjectNN基准测试上创造了新纪录（91.5%总体准确率），同时仅利用了140万参数（比当前SOTA少10倍），标志着在没有人工注释的情况下实现可扩展的多模态三维表示学习的突破。代码和数据集可在https://github.com/salesforce/ULIP找到。

English

Recent advancements in multimodal pre-training methods have shown promising efficacy in 3D representation learning by aligning features across 3D modality, their 2D counterpart modality, and corresponding language modality. However, the methods used by existing multimodal pre-training frameworks to gather multimodal data for 3D applications lack scalability and comprehensiveness, potentially constraining the full potential of multimodal learning. The main bottleneck lies in the language modality's scalability and comprehensiveness. To address this bottleneck, we introduce ULIP-2, a multimodal pre-training framework that leverages state-of-the-art multimodal large language models (LLMs) pre-trained on extensive knowledge to automatically generate holistic language counterparts for 3D objects. We conduct experiments on two large-scale datasets, Objaverse and ShapeNet55, and release our generated three-modality triplet datasets (3D Point Cloud - Image - Language), named "ULIP-Objaverse Triplets" and "ULIP-ShapeNet Triplets". ULIP-2 requires only 3D data itself and eliminates the need for any manual annotation effort, demonstrating its scalability; and ULIP-2 achieves remarkable improvements on downstream zero-shot classification on ModelNet40 (74% Top1 Accuracy). Moreover, ULIP-2 sets a new record on the real-world ScanObjectNN benchmark (91.5% Overall Accuracy) while utilizing only 1.4 million parameters(~10x fewer than current SOTA), signifying a breakthrough in scalable multimodal 3D representation learning without human annotations. The code and datasets are available at https://github.com/salesforce/ULIP.

ULIP-2：面向可扩展的多模态预训练，用于3D理解。

ULIP-2: Towards Scalable Multimodal Pre-training For 3D Understanding

摘要

Support