RAD-2:在生成器-判别器框架中扩展强化学习
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
April 16, 2026
作者: Hao Gao, Shaoyu Chen, Yifan Zhu, Yuehao Song, Wenyu Liu, Qian Zhang, Xinggang Wang
cs.AI
摘要
高级别自动驾驶需要运动规划器能够建模多模态未来不确定性,同时在闭环交互中保持稳健性。尽管基于扩散模型的规划器能有效处理复杂轨迹分布,但在纯模仿学习训练下常面临随机性不稳定问题,且缺乏纠正性负反馈。为解决这些问题,我们提出RAD-2——一种面向闭环规划的生成器-判别器统一框架。具体而言,扩散模型生成器负责生成多样化轨迹候选,而经过强化学习优化的判别器则根据长期驾驶质量对这些候选轨迹进行重排序。这种解耦设计避免了将稀疏标量奖励直接应用于全高维轨迹空间,从而提升优化稳定性。为增强强化学习效果,我们提出时序一致性群组相对策略优化,利用时序连贯性缓解信用分配问题。此外,我们引入同策略生成器优化技术,将闭环反馈转化为结构化纵向优化信号,逐步将生成器导向高奖励轨迹流形。为支持高效大规模训练,我们开发了BEV-Warp高通量仿真环境,通过空间变换直接在鸟瞰图特征空间进行闭环评估。实验表明,RAD-2相较于强扩散基线碰撞率降低56%。实路部署进一步验证了该系统在复杂城市交通中具有更高的感知安全性和行驶平顺度。
English
High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.