ChatPaper.aiChatPaper

EmerNeRF:透過自監督實現新興的時空場景分解

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision

November 3, 2023
作者: Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, Yue Wang
cs.AI

摘要

我們提出了 EmerNeRF,一種簡單而強大的方法,用於學習動態駕駛場景的時空表示。EmerNeRF基於神經場,通過自我引導同時捕捉場景幾何、外觀、運動和語義。EmerNeRF依賴於兩個核心組件:首先,將場景分層為靜態和動態場。這種分解純粹來自自我監督,使我們的模型能夠從一般的野外數據來源中學習。其次,EmerNeRF從動態場參數化一個誘導流場,並使用該流場進一步聚合多幀特徵,增強動態物體的渲染精度。結合這三個場(靜態、動態和流)使EmerNeRF能夠自給自足地表示高度動態的場景,而無需依賴地面真實對象標註或預先訓練的動態對象分割或光流估計模型。我們的方法在感測器模擬中實現了最先進的性能,在重建靜態(+2.93 PSNR)和動態(+3.70 PSNR)場景時明顯優於先前的方法。此外,為了增強EmerNeRF的語義泛化能力,我們將2D視覺基礎模型特徵提升到4D時空,並解決現代Transformer中的一個一般位置偏差,顯著提升3D感知性能(例如,平均佔有預測準確性相對提高了37.50%)。最後,我們構建了一個多樣且具有挑戰性的120序列數據集,以在極端和高度動態的情況下對神經場進行基準測試。
English
We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.
PDF161December 15, 2024