ChatPaper.aiChatPaper

反思驱动-2:基于强化学习对齐的离散扩散驱动自编辑技术

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

May 6, 2026
作者: Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, Kun Zhan
cs.AI

摘要

我们推出ReflectDrive-2——一种配备独立动作专家的掩码离散扩散规划器,该系统将驾驶规划表示为离散轨迹令牌,并通过并行掩码解码生成。这种离散令牌空间支持原位轨迹修订:AutoEdit功能可直接使用同一模型重写选定令牌,无需辅助优化网络。为训练该能力,我们采用两阶段训练方案:首先沿纵向进程和横向航向构建专家轨迹的结构化扰动数据,监督模型恢复原始专家轨迹;随后通过强化学习对决策-草拟-反思的全流程进行微调,将最终驾驶奖励分配给编辑后的轨迹,并通过全流程转移传递策略梯度信用。实验证明全流程强化学习对耦合草拟与编辑至关重要:仅监督训练时AutoEdit仅能提升PDMS指标0.3分,而强化学习将其增益提升至1.9分。我们还协同设计了高效的反射解码栈,结合共享前缀KV缓存、交替步进解码与端侧融合解掩码技术。在NAVSIM测试中,ReflectDrive-2在纯视觉输入下达到91.0 PDMS,6选1先知模式下达94.8 PDMS,在NVIDIA Thor平台实现31.8毫秒平均延迟。
English
We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most 0.3, whereas RL increases its gain to 1.9. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves 91.0 PDMS with camera-only input and 94.8 PDMS in a best-of-6 oracle setting, while running at 31.8 ms average latency on NVIDIA Thor.
PDF62May 9, 2026