CaRL：利用简单奖励学习可扩展的规划策略

摘要

我们研究了强化学习（RL）在自动驾驶特权规划中的应用。目前该任务的最先进方法基于规则，但这些方法难以应对长尾问题。相比之下，RL具有可扩展性，且不会像模仿学习那样出现误差累积。当代用于驾驶的RL方法采用复杂的复合奖励，即多个单独奖励的叠加，例如进度、位置或方向奖励。我们发现，当增加小批量大小时，PPO无法优化这些奖励的流行版本，这限制了这些方法的可扩展性。为此，我们提出了一种新的奖励设计，主要基于优化一个直观的奖励项：路线完成度。违规行为通过终止回合或乘法减少路线完成度来惩罚。我们发现，使用我们设计的简单奖励进行训练时，PPO在更高的小批量大小下表现良好，甚至提升了性能。通过大规模小批量训练，能够利用分布式数据并行实现高效扩展。我们在CARLA中扩展PPO至3亿样本，在nuPlan中扩展至5亿样本，仅使用单个8-GPU节点。所得模型在CARLA longest6 v2基准测试中达到64 DS，大幅领先于采用更复杂奖励的其他RL方法。该方法仅需对CARLA中的使用进行最小调整，便成为nuPlan上最佳的学习型方法。在Val14基准测试中，非反应性交通得分为91.3，反应性交通得分为90.6，同时比先前工作快一个数量级。

English

We investigate reinforcement learning (RL) for privileged planning in autonomous driving. State-of-the-art approaches for this task are rule-based, but these methods do not scale to the long tail. RL, on the other hand, is scalable and does not suffer from compounding errors like imitation learning. Contemporary RL approaches for driving use complex shaped rewards that sum multiple individual rewards, \eg~progress, position, or orientation rewards. We show that PPO fails to optimize a popular version of these rewards when the mini-batch size is increased, which limits the scalability of these approaches. Instead, we propose a new reward design based primarily on optimizing a single intuitive reward term: route completion. Infractions are penalized by terminating the episode or multiplicatively reducing route completion. We find that PPO scales well with higher mini-batch sizes when trained with our simple reward, even improving performance. Training with large mini-batch sizes enables efficient scaling via distributed data parallelism. We scale PPO to 300M samples in CARLA and 500M samples in nuPlan with a single 8-GPU node. The resulting model achieves 64 DS on the CARLA longest6 v2 benchmark, outperforming other RL methods with more complex rewards by a large margin. Requiring only minimal adaptations from its use in CARLA, the same method is the best learning-based approach on nuPlan. It scores 91.3 in non-reactive and 90.6 in reactive traffic on the Val14 benchmark while being an order of magnitude faster than prior work.

CaRL：利用简单奖励学习可扩展的规划策略

CaRL: Learning Scalable Planning Policies with Simple Rewards

摘要

Support