RAD-2:在生成器-判別器框架中擴展強化學習
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
April 16, 2026
作者: Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang
cs.AI
摘要
高階自動駕駛需要具備多模態未來不確定性建模能力的運動規劃器,同時能在閉環互動中保持穩健性。儘管基於擴散模型的規劃器能有效處理複雜軌跡分佈,但在純模仿學習訓練下常面臨隨機不穩定性問題,且缺乏矯正性負反饋機制。為解決這些問題,我們提出RAD-2——一個面向閉環規劃的統一生成器-判別器框架。具體而言,擴散式生成器用於產生多樣化軌跡候選,而經過強化學習優化的判別器則根據長期駕駛品質對候選軌跡進行重排序。這種解耦設計避免了將稀疏標量獎勵直接應用於完整高維軌跡空間,從而提升優化穩定性。為增強強化學習效能,我們提出時間一致性群組相對策略優化法,利用時間連貫性緩解信用分配問題。此外,我們提出在線生成器優化技術,將閉環反饋轉化為結構化縱向優化信號,逐步將生成器導向高獎勵軌跡流形。為支持大規模高效訓練,我們開發BEV-Warp高吞吐模擬環境,通過空間扭曲技術直接在鳥瞰圖特徵空間進行閉環評估。相較於強基線擴散規劃器,RAD-2將碰撞率降低56%。真實場景部署進一步驗證了其在複雜城市交通中提升感知安全性與行駛平順度的成效。
English
High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.