ChatPaper.aiChatPaper

NeRF-MAE:用於自監督式三維表示學習的遮罩式自編碼器,適用於神經輻射場。

NeRF-MAE: Masked AutoEncoders for Self-Supervised 3D Representation Learning for Neural Radiance Fields

April 1, 2024
作者: Muhammad Zubair Irshad, Sergey Zakharov, Vitor Guizilini, Adrien Gaidon, Zsolt Kira, Rares Ambrus
cs.AI

摘要

神經場在計算機視覺和機器人技術方面表現卓越,因為它們能夠理解三維視覺世界,例如推斷語義、幾何和動態。考慮到神經場在從二維圖像密集表示三維場景方面的能力,我們提出一個問題:我們是否可以擴展它們的自監督預訓練,特別是使用遮罩自編碼器,以從姿態RGB圖像生成有效的三維表示。由於將Transformer擴展到新的數據模態取得了驚人的成功,我們採用標準的三維視覺Transformer來適應NeRF的獨特制定。我們利用NeRF的體積網格作為Transformer的密集輸入,與其他三維表示(如點雲)形成對比,其中信息密度可能不均勻,且表示不規則。由於將遮罩自編碼器應用於如NeRF之類的隱式表示的困難,我們選擇提取一個明確的表示,通過利用相機軌跡進行採樣,將場景在不同領域中進行規範化。我們的目標是通過從NeRF的輝度和密度網格中遮罩隨機補丁,並利用標準的三維Swin Transformer來重建遮罩補丁,使模型能夠學習完整場景的語義和空間結構。我們在我們提出的精心策劃的姿態RGB數據集上以大規模進行這種表示的預訓練,總計超過180萬張圖像。一旦預訓練完成,編碼器就可用於有效的三維遷移學習。我們為NeRF提出的新型自監督預訓練方法NeRF-MAE擴展得非常好,並提高了各種具有挑戰性的三維任務的性能。利用未標記的姿態二維數據進行預訓練,NeRF-MAE在Front3D和ScanNet數據集上的性能明顯優於自監督三維預訓練和NeRF場景理解基線,三維物體檢測的AP50和AP25絕對性能提高超過20%和8%。
English
Neural fields excel in computer vision and robotics due to their ability to understand the 3D visual world such as inferring semantics, geometry, and dynamics. Given the capabilities of neural fields in densely representing a 3D scene from 2D images, we ask the question: Can we scale their self-supervised pretraining, specifically using masked autoencoders, to generate effective 3D representations from posed RGB images. Owing to the astounding success of extending transformers to novel data modalities, we employ standard 3D Vision Transformers to suit the unique formulation of NeRFs. We leverage NeRF's volumetric grid as a dense input to the transformer, contrasting it with other 3D representations such as pointclouds where the information density can be uneven, and the representation is irregular. Due to the difficulty of applying masked autoencoders to an implicit representation, such as NeRF, we opt for extracting an explicit representation that canonicalizes scenes across domains by employing the camera trajectory for sampling. Our goal is made possible by masking random patches from NeRF's radiance and density grid and employing a standard 3D Swin Transformer to reconstruct the masked patches. In doing so, the model can learn the semantic and spatial structure of complete scenes. We pretrain this representation at scale on our proposed curated posed-RGB data, totaling over 1.8 million images. Once pretrained, the encoder is used for effective 3D transfer learning. Our novel self-supervised pretraining for NeRFs, NeRF-MAE, scales remarkably well and improves performance on various challenging 3D tasks. Utilizing unlabeled posed 2D data for pretraining, NeRF-MAE significantly outperforms self-supervised 3D pretraining and NeRF scene understanding baselines on Front3D and ScanNet datasets with an absolute performance improvement of over 20% AP50 and 8% AP25 for 3D object detection.

Summary

AI-Generated Summary

PDF42November 28, 2024