ChatPaper.aiChatPaper

ReflectDrive-2:基于强化学习对齐的离散扩散驱动自编辑框架

ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving

May 6, 2026
作者: Huimin Wang, Yue Wang, Bihao Cui, Pengxiang Li, Ben Lu, Mingqian Wang, Tong Wang, Chuan Tang, Teng Zhang, Kun Zhan
cs.AI

摘要

我们推出ReflectDrive-2——一种配备独立动作专家的掩码离散扩散规划器,该系统将自动驾驶规划表示为离散轨迹令牌,并通过并行掩码解码生成。这种离散令牌空间实现了原位轨迹修订功能:AutoEdit无需辅助优化网络,即可使用同一模型重写选定令牌。为训练此能力,我们采用两阶段流程:首先沿纵向进程和横向航向构建专家轨迹的结构感知扰动,并监督模型恢复原始专家轨迹;随后通过强化学习对决策-起草-反思的全流程进行微调,将最终驾驶奖励分配给编辑后的轨迹,并通过全流程转移传播策略梯度信用。全流程强化学习被证明对起草与编辑的耦合至关重要:仅监督训练时,推理阶段的AutoEdit至多提升PDMS指标0.3分,而强化学习将其增益提高至1.9分。我们还协同设计了高效的反射解码栈,结合共享前缀KV复用、交替步进解码与端侧融合解掩码技术。在NAVSIM基准测试中,ReflectDrive-2在纯视觉输入下达到91.0 PDMS,六选一先知模式下达94.8 PDMS,在NVIDIA Thor平台上的平均延迟仅为31.8毫秒。
English
We introduce ReflectDrive-2, a masked discrete diffusion planner with separate action expert for autonomous driving that represents plans as discrete trajectory tokens and generates them through parallel masked decoding. This discrete token space enables in-place trajectory revision: AutoEdit rewrites selected tokens using the same model, without requiring an auxiliary refinement network. To train this capability, we use a two-stage procedure. First, we construct structure-aware perturbations of expert trajectories along longitudinal progress and lateral heading directions and supervise the model to recover the original expert trajectory. We then fine-tune the full decision--draft--reflect rollout with reinforcement learning (RL), assigning terminal driving reward to the final post-edit trajectory and propagating policy-gradient credit through full-rollout transitions. Full-rollout RL proves crucial for coupling drafting and editing: under supervised training alone, inference-time AutoEdit improves PDMS by at most 0.3, whereas RL increases its gain to 1.9. We also co-design an efficient reflective decoding stack for the decision--draft--reflect pipeline, combining shared-prefix KV reuse, Alternating Step Decode, and fused on-device unmasking. On NAVSIM, ReflectDrive-2 achieves 91.0 PDMS with camera-only input and 94.8 PDMS in a best-of-6 oracle setting, while running at 31.8 ms average latency on NVIDIA Thor.
PDF62May 9, 2026