AlphaDrive：強化学習と推論による自律走行における視覚言語モデルの可能性の解放

要旨

OpenAI o1とDeepSeek R1は、数学や科学などの複雑な領域において、強化学習（RL）と推論が重要な役割を果たすことで、人間の専門家レベルの性能を達成し、あるいはそれを上回っています。自動運転においては、最近のエンドツーエンドモデルが計画性能を大幅に向上させていますが、常識や推論能力の限界から、ロングテール問題に依然として苦戦しています。一部の研究では、視覚言語モデル（VLM）を自動運転に統合していますが、これらは通常、運転データに対する単純な教師あり微調整（SFT）を施した事前学習モデルに依存しており、計画に特化したトレーニング戦略や最適化のさらなる探求は行われていません。本論文では、自動運転におけるVLMのためのRLと推論フレームワークであるAlphaDriveを提案します。AlphaDriveは、計画に特化した4つのGRPOベースのRL報酬を導入し、SFTとRLを組み合わせた2段階の計画推論トレーニング戦略を採用しています。その結果、AlphaDriveは、SFTのみを使用する場合や推論を行わない場合と比較して、計画性能とトレーニング効率の両方を大幅に向上させます。さらに、RLトレーニング後にAlphaDriveがいくつかの新たなマルチモーダル計画能力を示すことも発見し、これは運転の安全性と効率性を向上させるために重要です。私たちの知る限り、AlphaDriveはGRPOベースのRLと計画推論を自動運転に統合した最初の試みです。今後の研究を促進するために、コードを公開する予定です。

English

OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration of training strategies or optimizations specifically tailored for planning. In this paper, we propose AlphaDrive, a RL and reasoning framework for VLMs in autonomous driving. AlphaDrive introduces four GRPO-based RL rewards tailored for planning and employs a two-stage planning reasoning training strategy that combines SFT with RL. As a result, AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning. Moreover, we are also excited to discover that, following RL training, AlphaDrive exhibits some emergent multimodal planning capabilities, which is critical for improving driving safety and efficiency. To the best of our knowledge, AlphaDrive is the first to integrate GRPO-based RL with planning reasoning into autonomous driving. Code will be released to facilitate future research.