ARKit LabelMaker:室内3D场景理解的新尺度
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
October 17, 2024
作者: Guangda Ji, Silvan Weder, Francis Engelmann, Marc Pollefeys, Hermann Blum
cs.AI
摘要
神经网络的性能随着其规模和训练数据量的增加而提高。这一点在语言和图像生成中得到了证实。然而,这需要具有规模友好的网络架构以及大规模数据集。尽管像Transformer这样的规模友好的架构已经出现用于3D视觉任务,但由于缺乏训练数据,3D视觉的GPT时刻仍然遥不可及。在本文中,我们介绍了ARKit LabelMaker,这是第一个具有密集语义注释的大规模真实世界3D数据集。具体来说,我们通过在大规模自动生成的密集语义注释中补充ARKitScenes数据集。为此,我们扩展了LabelMaker,这是一个最近的自动注释流程,以满足大规模预训练的需求。这涉及使用尖端分割模型扩展流程,同时使其能够应对大规模处理的挑战。此外,我们通过使用流行的3D语义分割模型在ScanNet和ScanNet200数据集上推动了最新技术的性能,展示了我们生成的数据集的有效性。
English
The performance of neural networks scales with both their size and the amount
of data they have been trained on. This is shown in both language and image
generation. However, this requires scaling-friendly network architectures as
well as large-scale datasets. Even though scaling-friendly architectures like
transformers have emerged for 3D vision tasks, the GPT-moment of 3D vision
remains distant due to the lack of training data. In this paper, we introduce
ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense
semantic annotations. Specifically, we complement ARKitScenes dataset with
dense semantic annotations that are automatically generated at scale. To this
end, we extend LabelMaker, a recent automatic annotation pipeline, to serve the
needs of large-scale pre-training. This involves extending the pipeline with
cutting-edge segmentation models as well as making it robust to the challenges
of large-scale processing. Further, we push forward the state-of-the-art
performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic
segmentation models, demonstrating the efficacy of our generated dataset.Summary
AI-Generated Summary