運転世界モデルを知覚タスクのための合成データ生成器として再考する

要旨

近年の運転世界モデルの進歩により、高品質なRGBビデオやマルチモーダルビデオの制御可能な生成が可能となった。既存手法は主に生成品質と制御性に関する評価指標に焦点を当てている。しかし、自動運転の性能にとって極めて重要である下流の知覚タスクの評価が往々にして見落とされている。既存手法では一般に、合成データで事前学習した後実データでファインチューニングする訓練戦略を採用するため、ベースライン（実データのみ）と比較して2倍のエポック数を要する。ベースラインのエポック数を2倍にすると、合成データの利点は無視できる程度になる。合成データの利点を徹底的に実証するため、我々は下流知覚タスクを強化する新しい合成データ生成フレームワーク「Dream4Drive」を提案する。Dream4Driveはまず入力ビデオを複数の3D認識ガイダンスマップに分解し、その後3Dアセットをこれらのガイダンスマップ上にレンダリングする。最後に、運転世界モデルを微調整して編集されたマルチビューの写実的なビデオを生成し、これを下流の知覚モデルの訓練に利用する。Dream4Driveは、大規模なマルチビューコーナーケースの生成において前例のない柔軟性を実現し、自動運転におけるコーナーケース知覚を大幅に強化する。将来の研究の発展に貢献するため、典型的な運転シナリオのカテゴリを網羅し、多様な3D認識ビデオ編集を可能にする大規模3Dアセットデータセット「DriveObj3D」も公開する。包括的な実験により、Dream4Driveが様々な訓練エポック数条件下で下流知覚モデルの性能を効果的に向上させ得ることを示す。

English

Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are really crucial for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Page: https://wm-research.github.io/Dream4Drive/ GitHub Link: https://github.com/wm-research/Dream4Drive

運転世界モデルを知覚タスクのための合成データ生成器として再考する

Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

要旨

Support