CARFF: 3Dシーン予測のための条件付き自己符号化放射フィールド

要旨

我々はCARFF（Conditional Auto-encoded Radiance Field for 3D Scene Forecasting）を提案する。これは、2Dのエゴセントリック画像のような過去の観測から未来の3Dシーンを予測する手法である。本手法では、確率的エンコーダを用いて画像を可能な3D潜在シーン構成の分布にマッピングし、仮定されたシーンの時間的進化を予測する。我々の潜在シーン表現は、グローバルなNeural Radiance Field（NeRF）を条件付け、3Dシーンモデルを表現する。これにより、説明可能な予測と直感的な下流アプリケーションが可能となる。このアプローチは、環境状態とダイナミクスの不確実性を考慮することで、従来のニューラルレンダリングの研究を拡張するものである。我々は、Pose-Conditional-VAEとNeRFの2段階のトレーニングを用いて3D表現を学習する。さらに、部分観測マルコフ決定過程として潜在シーン表現を自己回帰的に予測するために、混合密度ネットワークを活用する。我々は、CARLA運転シミュレータを用いた現実的なシナリオで本手法の有用性を実証し、視覚的オクルージョンを伴う複雑なマルチエージェント自動運転シナリオにおいて、CARFFが効率的な軌道計画と緊急時計画を可能にすることを示す。

English

We propose CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting, a method for predicting future 3D scenes given past observations, such as 2D ego-centric images. Our method maps an image to a distribution over plausible 3D latent scene configurations using a probabilistic encoder, and predicts the evolution of the hypothesized scenes through time. Our latent scene representation conditions a global Neural Radiance Field (NeRF) to represent a 3D scene model, which enables explainable predictions and straightforward downstream applications. This approach extends beyond previous neural rendering work by considering complex scenarios of uncertainty in environmental states and dynamics. We employ a two-stage training of Pose-Conditional-VAE and NeRF to learn 3D representations. Additionally, we auto-regressively predict latent scene representations as a partially observable Markov decision process, utilizing a mixture density network. We demonstrate the utility of our method in realistic scenarios using the CARLA driving simulator, where CARFF can be used to enable efficient trajectory and contingency planning in complex multi-agent autonomous driving scenarios involving visual occlusions.