OpenShape:扩展三维形状表示以实现开放世界理解。
OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding
May 18, 2023
作者: Minghua Liu, Ruoxi Shi, Kaiming Kuang, Yinhao Zhu, Xuanlin Li, Shizhong Han, Hong Cai, Fatih Porikli, Hao Su
cs.AI
摘要
我们介绍了OpenShape,这是一种用于学习文本、图像和点云的多模态联合表示的方法。我们采用了常用的多模态对比学习框架来进行表示对齐,但特别关注扩展3D表示以实现开放世界的3D形状理解。为了实现这一目标,我们通过集成多个3D数据集来扩大训练数据,并提出了几种策略来自动过滤和丰富嘈杂的文本描述。我们还探讨并比较了用于扩展3D骨干网络的策略,并引入了一种新颖的硬负样本挖掘模块以实现更高效的训练。我们在零样本3D分类基准测试上评估了OpenShape,并展示了其在开放世界识别方面的卓越能力。具体而言,OpenShape在包含1,156个类别的Objaverse-LVIS基准测试中实现了46.8%的零样本准确率,而现有方法的准确率不到10%。OpenShape在ModelNet40上也取得了85.3%的准确率,优于先前的零样本基线方法20%,并与一些完全监督的方法持平。此外,我们展示了我们学到的嵌入式编码了广泛的视觉和语义概念(例如,子类别、颜色、形状、风格),并促进了细粒度的文本-3D和图像-3D交互。由于它们与CLIP嵌入的对齐,我们学到的形状表示还可以与现成的基于CLIP的模型集成,用于各种应用,如点云字幕和点云条件图像生成。
English
We introduce OpenShape, a method for learning multi-modal joint
representations of text, image, and point clouds. We adopt the commonly used
multi-modal contrastive learning framework for representation alignment, but
with a specific focus on scaling up 3D representations to enable open-world 3D
shape understanding. To achieve this, we scale up training data by ensembling
multiple 3D datasets and propose several strategies to automatically filter and
enrich noisy text descriptions. We also explore and compare strategies for
scaling 3D backbone networks and introduce a novel hard negative mining module
for more efficient training. We evaluate OpenShape on zero-shot 3D
classification benchmarks and demonstrate its superior capabilities for
open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy
of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than
10% for existing methods. OpenShape also achieves an accuracy of 85.3% on
ModelNet40, outperforming previous zero-shot baseline methods by 20% and
performing on par with some fully-supervised methods. Furthermore, we show that
our learned embeddings encode a wide range of visual and semantic concepts
(e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D
and image-3D interactions. Due to their alignment with CLIP embeddings, our
learned shape representations can also be integrated with off-the-shelf
CLIP-based models for various applications, such as point cloud captioning and
point cloud-conditioned image generation.