ReflectDrive-2：強化学習に整合した離散拡散駆動のための自己編集

要旨

ReflectDrive-2を紹介する。これは、計画を離散軌道トークンとして表現し、並列マスク復号によって生成する、自律走行のための行動専門家を分離したマスク離散拡散プランナーである。この離散トークン空間は、軌道のその場修正を可能にする：AutoEditは、補助的なリファインメントネットワークを必要とせず、同じモデルを使用して選択されたトークンを書き換える。この能力を訓練するため、我々は2段階の手順を用いる。まず、専門家軌道に対して縦方向の進捗と横方向の方位に沿った構造を考慮した摂動を加え、モデルが元の専門家軌道を回復するように教師監督する。その後、意思決定―草案―反映の完全ロールアウトを強化学習（RL）でファインチューニングし、最終的な編集後軌道に終端の走行報酬を割り当て、政策勾配の信用を完全ロールアウトの遷移を通じて伝播させる。完全ロールアウトRLは、草案作成と編集の結合に極めて重要であることが証明された：教師監督訓練のみでは、推論時のAutoEditによるPDMS改善は最大0.3であったが、RLではその改善幅が1.9に増加した。また、意思決定―草案―反映パイプライン向けに、効率的な反映的復号スタックを共同設計した。これは、共有プレフィックスKV再利用、交互ステップ復号、およびオンデバイスでの融合アンマスキングを組み合わせたものである。NAVSIMにおいて、ReflectDrive-2はカメラのみの入力で91.0 PDMS、ベストオブ6のオラクル設定では94.8 PDMSを達成し、NVIDIA Thor上で平均31.8 msのレイテンシで動作する。

English

We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most 0.3, whereas RL increases its gain to 1.9. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves 91.0 PDMS with camera-only input and 94.8 PDMS in a best-of-6 oracle setting, while running at 31.8 ms average latency on NVIDIA Thor.

ReflectDrive-2：強化学習に整合した離散拡散駆動のための自己編集

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

要旨

Support