AlphaDrive：透過強化學習與推理釋放視覺語言模型在自動駕駛中的潛能

摘要

OpenAI o1 和 DeepSeek R1 在數學和科學等複雜領域中，通過強化學習（RL）和推理的關鍵作用，達到甚至超越了人類專家級別的性能。在自動駕駛領域，近期的端到端模型顯著提升了規劃性能，但由於常識和推理能力的限制，仍難以應對長尾問題。一些研究將視覺語言模型（VLMs）整合到自動駕駛中，但這些研究通常依賴於預訓練模型，並僅對駕駛數據進行簡單的監督微調（SFT），而沒有進一步探索專門針對規劃的訓練策略或優化方法。本文提出了一種名為 AlphaDrive 的 RL 和推理框架，專為自動駕駛中的 VLMs 設計。AlphaDrive 引入了四種基於 GRPO 的 RL 獎勵機制，專門針對規劃任務，並採用了一種結合 SFT 與 RL 的兩階段規劃推理訓練策略。結果表明，與僅使用 SFT 或缺乏推理的方法相比，AlphaDrive 顯著提升了規劃性能和訓練效率。此外，我們還驚喜地發現，經過 RL 訓練後，AlphaDrive 展現出一些新興的多模態規劃能力，這對於提升駕駛安全性和效率至關重要。據我們所知，AlphaDrive 是首個將基於 GRPO 的 RL 與規劃推理整合到自動駕駛中的方法。我們將公開代碼，以促進未來的研究。

English

OpenAI o1 and DeepSeek R1 achieve or even surpass human expert-level performance in complex domains like mathematics and science, with reinforcement learning (RL) and reasoning playing a crucial role. In autonomous driving, recent end-to-end models have greatly improved planning performance but still struggle with long-tailed problems due to limited common sense and reasoning abilities. Some studies integrate vision-language models (VLMs) into autonomous driving, but they typically rely on pre-trained models with simple supervised fine-tuning (SFT) on driving data, without further exploration of training strategies or optimizations specifically tailored for planning. In this paper, we propose AlphaDrive, a RL and reasoning framework for VLMs in autonomous driving. AlphaDrive introduces four GRPO-based RL rewards tailored for planning and employs a two-stage planning reasoning training strategy that combines SFT with RL. As a result, AlphaDrive significantly improves both planning performance and training efficiency compared to using only SFT or without reasoning. Moreover, we are also excited to discover that, following RL training, AlphaDrive exhibits some emergent multimodal planning capabilities, which is critical for improving driving safety and efficiency. To the best of our knowledge, AlphaDrive is the first to integrate GRPO-based RL with planning reasoning into autonomous driving. Code will be released to facilitate future research.