CARFF: 3D 장면 예측을 위한 조건부 자동 인코딩 방사 필드

초록

우리는 3D 장면 예측을 위한 조건부 자동 인코딩 방사 필드(CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting)를 제안한다. 이 방법은 2D 자체 중심 이미지와 같은 과거 관측을 기반으로 미래의 3D 장면을 예측한다. 우리의 방법은 확률적 인코더를 사용하여 이미지를 가능성 있는 3D 잠재 장면 구성의 분포로 매핑하고, 가정된 장면의 시간적 진화를 예측한다. 우리의 잠재 장면 표현은 전역 신경 방사 필드(NeRF)를 조건화하여 3D 장면 모델을 표현하며, 이는 설명 가능한 예측과 직관적인 다운스트림 응용을 가능하게 한다. 이 접근법은 환경 상태와 역학의 불확실성을 고려함으로써 기존의 신경 렌더링 연구를 확장한다. 우리는 3D 표현을 학습하기 위해 포즈 조건부 VAE(Pose-Conditional-VAE)와 NeRF의 두 단계 훈련을 사용한다. 또한, 혼합 밀도 네트워크를 활용하여 부분적으로 관측 가능한 마르코프 결정 과정으로 잠재 장면 표현을 자동 회귀적으로 예측한다. 우리는 CARLA 운전 시뮬레이터를 사용한 현실적인 시나리오에서 우리의 방법의 유용성을 입증하며, CARFF가 시각적 폐색이 포함된 복잡한 다중 에이전트 자율 주행 시나리오에서 효율적인 궤적 및 비상 계획을 가능하게 하는 데 사용될 수 있음을 보여준다.

English

We propose CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting, a method for predicting future 3D scenes given past observations, such as 2D ego-centric images. Our method maps an image to a distribution over plausible 3D latent scene configurations using a probabilistic encoder, and predicts the evolution of the hypothesized scenes through time. Our latent scene representation conditions a global Neural Radiance Field (NeRF) to represent a 3D scene model, which enables explainable predictions and straightforward downstream applications. This approach extends beyond previous neural rendering work by considering complex scenarios of uncertainty in environmental states and dynamics. We employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations. Additionally, we auto-regressively predict latent scene representations as a partially observable Markov decision process, utilizing a mixture density network. We demonstrate the utility of our method in realistic scenarios using the CARLA driving simulator, where CARFF can be used to enable efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving visual occlusions.