ChatPaper.aiChatPaper

NeRF-MAE:用于自监督三维表示学习的掩码自动编码器,用于神经辐射场

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

April 1, 2024
作者: Muhammad Zubair Irshad, Sergey Zakharov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus
cs.AI

摘要

神经场在计算机视觉和机器人领域表现出色,因为它们能够理解三维视觉世界,如推断语义、几何和动态。鉴于神经场在密集表示从二维图像中生成三维场景的能力,我们提出一个问题:我们能否扩展它们的自监督预训练,特别是使用遮罩自编码器,以从姿态RGB图像中生成有效的三维表示。由于将Transformer扩展到新的数据模态取得了惊人成功,我们采用标准的三维视觉Transformer来适应NeRF的独特表达形式。我们利用NeRF的体积网格作为Transformer的密集输入,与其他三维表示(如点云)形成对比,其中信息密度可能不均匀,表示不规则。由于将遮罩自编码器应用于隐式表示(如NeRF)的困难,我们选择提取显式表示,通过利用摄像机轨迹进行采样,以规范化跨领域的场景。我们的目标是通过从NeRF的辐射和密度网格中遮蔽随机补丁,并利用标准的三维Swin Transformer来重建被遮蔽的补丁,从而使模型能够学习完整场景的语义和空间结构。我们在我们提出的精心策划的姿态RGB数据上以规模预训练这种表示,总计超过180万张图像。一旦预训练完成,编码器就可用于有效的三维迁移学习。我们为NeRF提出的新型自监督预训练方法NeRF-MAE扩展得非常出色,并提高了各种具有挑战性的三维任务的性能。利用未标记的姿态二维数据进行预训练,NeRF-MAE在Front3D和ScanNet数据集上显著优于自监督三维预训练和NeRF场景理解基线,三维物体检测的AP50和AP25绝对性能提升超过20%和8%。
English
Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.

Summary

AI-Generated Summary

PDF42November 28, 2024