CaRL: 단순한 보상으로 확장 가능한 계획 정책 학습하기

초록

우리는 자율 주행에서의 특권적 계획(privileged planning)을 위한 강화 학습(Reinforcement Learning, RL)을 연구한다. 이 작업에 대한 최신 접근 방식은 규칙 기반이지만, 이러한 방법은 긴 꼬리(long tail) 문제로 확장성이 떨어진다. 반면, RL은 확장성이 뛰어나며 모방 학습(imitation learning)과 같은 오류 누적 문제를 겪지 않는다. 최근의 자율 주행을 위한 RL 접근법은 진행도, 위치, 방향 등과 같은 여러 개별 보상을 합산한 복잡한 형태의 보상을 사용한다. 우리는 미니 배치 크기가 증가할 때 이러한 보상의 인기 있는 버전을 PPO(Proximal Policy Optimization)가 최적화하지 못함을 보여주며, 이는 이러한 접근법의 확장성을 제한한다. 대신, 우리는 주로 직관적인 단일 보상 항목인 경로 완주(route completion)를 최적화하는 새로운 보상 설계를 제안한다. 위반 사항은 에피소드를 종료하거나 경로 완주를 곱셈적으로 감소시켜 처벌한다. 우리는 단순한 보상으로 훈련할 때 PPO가 더 큰 미니 배치 크기에서도 잘 확장되며, 성능이 개선됨을 발견했다. 큰 미니 배치 크기로 훈련하면 분산 데이터 병렬화를 통해 효율적인 확장이 가능하다. 우리는 단일 8-GPU 노드로 CARLA에서 300M 샘플, nuPlan에서 500M 샘플까지 PPO를 확장했다. 결과 모델은 CARLA longest6 v2 벤치마크에서 64 DS를 달성하며, 더 복잡한 보상을 사용한 다른 RL 방법을 큰 차이로 앞섰다. CARLA에서의 사용에 필요한 최소한의 적응만으로도 동일한 방법은 nuPlan에서 최고의 학습 기반 접근법이다. 이 방법은 Val14 벤치마크에서 비반응형(non-reactive) 트래픽에서 91.3, 반응형(reactive) 트래픽에서 90.6의 점수를 기록하며, 이전 작업보다 한 차원 빠른 속도를 보인다.

English

We investigate reinforcement learning (RL) for privileged planning in autonomous driving. State-of-the-art approaches for this task are rule-based, but these methods do not scale to the long tail. RL, on the other hand, is scalable and does not suffer from compounding errors like imitation learning. Contemporary RL approaches for driving use complex shaped rewards that sum multiple individual rewards, \eg~progress, position, or orientation rewards. We show that PPO fails to optimize a popular version of these rewards when the mini-batch size is increased, which limits the scalability of these approaches. Instead, we propose a new reward design based primarily on optimizing a single intuitive reward term: route completion. Infractions are penalized by terminating the episode or multiplicatively reducing route completion. We find that PPO scales well with higher mini-batch sizes when trained with our simple reward, even improving performance. Training with large mini-batch sizes enables efficient scaling via distributed data parallelism. We scale PPO to 300M samples in CARLA and 500M samples in nuPlan with a single 8-GPU node. The resulting model achieves 64 DS on the CARLA longest6 v2 benchmark, outperforming other RL methods with more complex rewards by a large margin. Requiring only minimal adaptations from its use in CARLA, the same method is the best learning-based approach on nuPlan. It scores 91.3 in non-reactive and 90.6 in reactive traffic on the Val14 benchmark while being an order of magnitude faster than prior work.

CaRL: 단순한 보상으로 확장 가능한 계획 정책 학습하기

CaRL: Learning Scalable Planning Policies with Simple Rewards

초록

Support