RAD-2: 生成器-判別器フレームワークにおける強化学習のスケーリング

要旨

高水準な自律走行には、マルチモーダルな将来の不確実性をモデル化できるとともに、閉ループ相互作用においてロバスト性を維持するモーションプランナーが求められる。拡散モデルベースのプランナーは複雑な軌道分布のモデル化に有効であるが、模倣学習のみで訓練された場合、確率的な不安定性や修正的ネガティブフィードバックの欠如に悩まされることが多い。これらの課題に対処するため、我々は閉ループ計画のための統合型生成器-識別器フレームワークであるRAD-2を提案する。具体的には、拡散モデルベースの生成器が多様な軌道候補を生成し、RLで最適化された識別器がそれらを長期的な走行品質に基づいて再ランク付けする。この分離設計により、疎なスカラー報酬を高次元軌道空間全体に直接適用することを回避し、最適化の安定性を向上させる。強化学習をさらに強化するため、時間的一貫性を利用して信用割当問題を緩和するTemporally Consistent Group Relative Policy Optimizationを導入する。加えて、閉ループフィードバックを構造化された縦方向最適化信号に変換し、生成器を高報酬軌道多様体に向けて漸進的にシフトさせるOn-policy Generator Optimizationを提案する。効率的な大規模訓練を支援するため、空間ワーピングを介してBird's-Eye View特徴空間で直接閉ループ評価を実行する高スループットシミュレーション環境BEV-Warpを導入する。RAD-2は強力な拡散モデルベースのプランナーと比較して衝突率を56%削減する。実世界での展開では、複雑な市街地交通における知覚安全性と走行平滑性の向上が実証された。

English

High-level autonomous driving requires motion planners capable of modeling multimodal future uncertainties while remaining robust in closed-loop interactions. Although diffusion-based planners are effective at modeling complex trajectory distributions, they often suffer from stochastic instabilities and the lack of corrective negative feedback when trained purely with imitation learning. To address these issues, we propose RAD-2, a unified generator-discriminator framework for closed-loop planning. Specifically, a diffusion-based generator is used to produce diverse trajectory candidates, while an RL-optimized discriminator reranks these candidates according to their long-term driving quality. This decoupled design avoids directly applying sparse scalar rewards to the full high-dimensional trajectory space, thereby improving optimization stability. To further enhance reinforcement learning, we introduce Temporally Consistent Group Relative Policy Optimization, which exploits temporal coherence to alleviate the credit assignment problem. In addition, we propose On-policy Generator Optimization, which converts closed-loop feedback into structured longitudinal optimization signals and progressively shifts the generator toward high-reward trajectory manifolds. To support efficient large-scale training, we introduce BEV-Warp, a high-throughput simulation environment that performs closed-loop evaluation directly in Bird's-Eye View feature space via spatial warping. RAD-2 reduces the collision rate by 56% compared with strong diffusion-based planners. Real-world deployment further demonstrates improved perceived safety and driving smoothness in complex urban traffic.

RAD-2: 生成器-判別器フレームワークにおける強化学習のスケーリング

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

要旨

Support