Sensor2Sensor: 異なるエンボディメント間のセンサ変換による自動運転

要旨

自動運転システム（ADS）の堅牢な訓練と検証には、大規模で多様なデータセットが必要である。自律走行車（AV）フリートが収集するプロプライエタリデータは高忠実度であるものの、規模、センサ構成の多様性、地理的カバレッジ、およびロングテールな行動パターンの網羅において限界がある。これに対し、ドライブレコーダーなどの実環境データは、膨大な規模と多様性を備え、重要なロングテールシナリオや未知の環境を捉えることができる。しかし、この非構造化された実環境のビデオデータは、構造化されたマルチモーダルなセンサ入力を前提とするADSの検証や訓練には適合しない。このデータギャップを埋めるため、我々はSensor2Sensorを提案する。これは、実環境の単眼ドライブレコーダー映像を、複数視点のカメラ画像とLiDAR点群からなる高忠実度なマルチモーダルセンサスイート（AVログ）に変換する新規な生成モデリングパラダイムである。中心的な課題は、ペアとなる訓練データが存在しないことである。我々はこれを、実際のAVログを4Dガウススプラッティング（4DGS）による再構成と新規視点レンダリングを介してドライブレコーダー風の映像に変換することで解決する。次にSensor2Sensorは拡散アーキテクチャを用いて生成変換を実行する。生成されたセンサデータの忠実性と現実性について包括的な定量的評価を行う。さらに、課題の多い実環境のインターネット映像やドライブレコーダー映像を現実的なマルチモーダルデータ形式に変換することでSensor2Sensorの実用性を実証し、AV開発のための膨大な外部データソースを解放する。

English

Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. In contrast, in-the-wild data from sources like dashcams offers immense scale and diversity, capturing critical long-tail scenarios and novel environments. However, this unstructured, in-the-wild video data is incompatible with ADS expecting structured, multi-modal sensor inputs for validation and training. To bridge this data gap, we propose Sensor2Sensor, a novel generative modeling paradigm that translates in-the-wild monocular dashcam videos into a high-fidelity, multi-modal sensor suite (AV logs) comprising multi-view camera images and LiDAR point clouds. A core challenge is the lack of paired training data. We address this by converting real AV logs into dashcam-style videos via 4D Gaussian Splatting (4DGS) reconstruction and novel-view rendering. Sensor2Sensor then utilizes a diffusion architecture to perform the generative conversion. We perform comprehensive quantitative evaluations on the fidelity and realism of the generated sensor data. We demonstrate Sensor2Sensor's practical utility by converting challenging in-the-wild internet and dashcam footage into realistic, multi-modal data formats, further unlocking vast external data sources for AV development.