EmerNeRF:通过自监督实现新兴的时空场景分解
EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision
November 3, 2023
作者: Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, Yue Wang
cs.AI
摘要
我们提出了EmerNeRF,这是一种简单而强大的方法,用于学习动态驾驶场景的时空表示。EmerNeRF基于神经场,通过自举方法同时捕捉场景几何、外观、运动和语义。EmerNeRF依赖于两个核心组件:首先,它将场景分为静态场和动态场。这种分解纯粹是通过自我监督得出的,使我们的模型能够从一般的野外数据源中学习。其次,EmerNeRF从动态场参数化出一个诱导流场,并利用这个流场进一步聚合多帧特征,提高动态物体的渲染精度。将这三个场(静态、动态和流)耦合在一起使EmerNeRF能够自给自足地表示高度动态的场景,而无需依赖地面真实物体注释或预训练模型进行动态物体分割或光流估计。我们的方法在传感器模拟中取得了最先进的性能,在重建静态(+2.93 PSNR)和动态(+3.70 PSNR)场景时明显优于先前的方法。此外,为了增强EmerNeRF的语义泛化能力,我们将2D视觉基础模型特征提升到4D时空,并解决了现代Transformer中的一般位置偏差问题,显著提升了3D感知性能(例如,平均占用预测准确度相对提高了37.50%)。最后,我们构建了一个多样且具有挑战性的120序列数据集,用于在极端和高度动态的环境下对神经场进行基准测试。
English
We present EmerNeRF, a simple yet powerful approach for learning
spatial-temporal representations of dynamic driving scenes. Grounded in neural
fields, EmerNeRF simultaneously captures scene geometry, appearance, motion,
and semantics via self-bootstrapping. EmerNeRF hinges upon two core components:
First, it stratifies scenes into static and dynamic fields. This decomposition
emerges purely from self-supervision, enabling our model to learn from general,
in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field
from the dynamic field and uses this flow field to further aggregate
multi-frame features, amplifying the rendering precision of dynamic objects.
Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to
represent highly-dynamic scenes self-sufficiently, without relying on ground
truth object annotations or pre-trained models for dynamic object segmentation
or optical flow estimation. Our method achieves state-of-the-art performance in
sensor simulation, significantly outperforming previous methods when
reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In
addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual
foundation model features into 4D space-time and address a general positional
bias in modern Transformers, significantly boosting 3D perception performance
(e.g., 37.50% relative improvement in occupancy prediction accuracy on
average). Finally, we construct a diverse and challenging 120-sequence dataset
to benchmark neural fields under extreme and highly-dynamic settings.