ARKit LabelMaker：室内3D场景理解的新尺度

摘要

神经网络的性能随着其规模和训练数据量的增加而提高。这一点在语言和图像生成中得到了证实。然而，这需要具有规模友好的网络架构以及大规模数据集。尽管像Transformer这样的规模友好的架构已经出现用于3D视觉任务，但由于缺乏训练数据，3D视觉的GPT时刻仍然遥不可及。在本文中，我们介绍了ARKit LabelMaker，这是第一个具有密集语义注释的大规模真实世界3D数据集。具体来说，我们通过在大规模自动生成的密集语义注释中补充ARKitScenes数据集。为此，我们扩展了LabelMaker，这是一个最近的自动注释流程，以满足大规模预训练的需求。这涉及使用尖端分割模型扩展流程，同时使其能够应对大规模处理的挑战。此外，我们通过使用流行的3D语义分割模型在ScanNet和ScanNet200数据集上推动了最新技术的性能，展示了我们生成的数据集的有效性。

English

The performance of neural networks scales with both their size and the amount of data they have been trained on. This is shown in both language and image generation. However, this requires scaling-friendly network architectures as well as large-scale datasets. Even though scaling-friendly architectures like transformers have emerged for 3D vision tasks, the GPT-moment of 3D vision remains distant due to the lack of training data. In this paper, we introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations. Specifically, we complement ARKitScenes dataset with dense semantic annotations that are automatically generated at scale. To this end, we extend LabelMaker, a recent automatic annotation pipeline, to serve the needs of large-scale pre-training. This involves extending the pipeline with cutting-edge segmentation models as well as making it robust to the challenges of large-scale processing. Further, we push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models, demonstrating the efficacy of our generated dataset.

ARKit LabelMaker：室内3D场景理解的新尺度

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

摘要

Support