CaRL:利用簡單獎勵學習可擴展的規劃策略
CaRL: Learning Scalable Planning Policies with Simple Rewards
April 24, 2025
作者: Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, Andreas Geiger
cs.AI
摘要
我們研究了強化學習(RL)在自動駕駛特權規劃中的應用。目前針對此任務的頂尖方法基於規則,但這些方法無法應對長尾問題。相比之下,RL具有可擴展性,且不會像模仿學習那樣出現錯誤累積。當代用於駕駛的RL方法採用複雜的形狀獎勵,這些獎勵由多個單獨獎勵(如進度、位置或方向獎勵)加總而成。我們發現,當增加小批量大小時,PPO無法優化這些獎勵的流行版本,這限制了這些方法的可擴展性。因此,我們提出了一種新的獎勵設計,主要基於優化單一直觀的獎勵項:路線完成度。違規行為通過終止回合或乘法減少路線完成度來懲罰。我們發現,當使用我們設計的簡單獎勵進行訓練時,PPO在較高的小批量大小下表現良好,甚至提升了性能。使用大批量大小進行訓練,可以通過分佈式數據並行實現高效擴展。我們將PPO擴展到在CARLA中處理3億個樣本,在nuPlan中處理5億個樣本,僅使用單個8-GPU節點。最終模型在CARLA longest6 v2基準測試中達到64 DS,大幅領先於使用更複雜獎勵的其他RL方法。僅需對其在CARLA中的應用進行最小程度的適應,該方法便成為nuPlan上最佳的基於學習的方法。在Val14基準測試中,它在非反應性和反應性交通中的得分分別為91.3和90.6,同時比之前的工作快一個數量級。
English
We investigate reinforcement learning (RL) for privileged planning in
autonomous driving. State-of-the-art approaches for this task are rule-based,
but these methods do not scale to the long tail. RL, on the other hand, is
scalable and does not suffer from compounding errors like imitation learning.
Contemporary RL approaches for driving use complex shaped rewards that sum
multiple individual rewards, \eg~progress, position, or orientation rewards. We
show that PPO fails to optimize a popular version of these rewards when the
mini-batch size is increased, which limits the scalability of these approaches.
Instead, we propose a new reward design based primarily on optimizing a single
intuitive reward term: route completion. Infractions are penalized by
terminating the episode or multiplicatively reducing route completion. We find
that PPO scales well with higher mini-batch sizes when trained with our simple
reward, even improving performance. Training with large mini-batch sizes
enables efficient scaling via distributed data parallelism. We scale PPO to
300M samples in CARLA and 500M samples in nuPlan with a single 8-GPU node. The
resulting model achieves 64 DS on the CARLA longest6 v2 benchmark,
outperforming other RL methods with more complex rewards by a large margin.
Requiring only minimal adaptations from its use in CARLA, the same method is
the best learning-based approach on nuPlan. It scores 91.3 in non-reactive and
90.6 in reactive traffic on the Val14 benchmark while being an order of
magnitude faster than prior work.