EmerNeRF: 자기 지도를 통한 시공간 장면 분해의 창발적 접근

초록

우리는 동적 주행 장면의 시공간적 표현을 학습하기 위한 간단하면서도 강력한 접근법인 EmerNeRF를 제안한다. EmerNeRF는 신경 필드(neural fields)에 기반하여, 장면의 기하학, 외관, 움직임, 그리고 의미론을 자가 부트스트래핑을 통해 동시에 포착한다. EmerNeRF는 두 가지 핵심 구성 요소에 의존한다: 첫째, 장면을 정적 필드와 동적 필드로 계층화한다. 이 분해는 순수하게 자기 지도 학습에서 발생하며, 이를 통해 우리의 모델은 일반적인 실제 데이터 소스로부터 학습할 수 있다. 둘째, EmerNeRF는 동적 필드에서 유도된 흐름 필드를 매개변수화하고, 이 흐름 필드를 사용하여 다중 프레임 특징을 추가로 집계함으로써 동적 객체의 렌더링 정밀도를 증폭시킨다. 이 세 가지 필드(정적, 동적, 흐름)를 결합함으로써 EmerNeRF는 동적 객체 분할이나 광학 흐름 추정을 위한 지상 진실 객체 주석이나 사전 학습된 모델에 의존하지 않고도 고도로 동적인 장면을 자체적으로 표현할 수 있다. 우리의 방법은 센서 시뮬레이션에서 최첨단 성능을 달성하며, 정적 장면(+2.93 PSNR)과 동적 장면(+3.70 PSNR)을 재구성할 때 이전 방법들을 크게 능가한다. 또한, EmerNeRF의 의미론적 일반화를 강화하기 위해, 우리는 2D 시각적 기반 모델 특징을 4D 시공간으로 리프트하고, 현대 트랜스포머의 일반적인 위치 편향을 해결함으로써 3D 인식 성능을 크게 향상시킨다(예: 점유 예측 정확도에서 평균 37.50% 상대적 개선). 마지막으로, 우리는 극단적이고 고도로 동적인 설정에서 신경 필드를 벤치마크하기 위해 다양하고 도전적인 120-시퀀스 데이터셋을 구축한다.

English

We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.

EmerNeRF: 자기 지도를 통한 시공간 장면 분해의 창발적 접근

EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision

초록

Support