FlowR2A: 学习多模态驾驶规划的奖励到动作分布

摘要

多模态驾驶规划长期以来面临两种范式之间的张力：基于评分的方法受益于密集的奖励监督，但受限于固定的动作词汇表；而基于锚点的方法能够动态生成提案，却因仅受单一真实轨迹约束而面临稀疏监督的困境。本文提出的FlowR2A方法通过将基于模拟的奖励从判别目标重构为生成条件，化解了这一矛盾。该方法利用流匹配解码器从密集轨迹-奖励对中学习奖励调节下的动作分布，从而在一个生成模型中统一了基于评分方法的密集监督与基于锚点方法的提案生成能力，迫使模型内化动作及其在安全性、进程、舒适性和规则合规性方面结果之间的关联。为平衡硬性安全约束与软性进程目标，我们引入了细粒度的每时间步奖励调节与奖励噪声增强。该生成式公式通过奖励引导和锚定采样自然地支持可控的测试时采样，从而生成高质量提案。FlowR2A在NAVSIM v1和v2基准测试中达到了最先进水平，其多模态提案质量显著优于此前方法。

English

Multimodal driving planning faces a long-standing tension between two paradigms: scoring-based methods benefit from dense reward supervision but are confined to a fixed action vocabulary, while anchor-based methods generate proposals dynamically yet suffer from sparse supervision constrained to a single ground-truth trajectory. In this work, we propose FlowR2A, which resolves this tension by reframing simulation-based rewards from discriminative targets into generative conditions. By learning the reward-conditioned action distribution from dense trajectory-reward pairs with a flow-matching decoder, FlowR2A unifies the dense supervision of scoring-based methods with the proposal generation of anchor-based methods in a single generative model, forcing the model to internalize the correlation between an action and its outcomes in safety, progress, comfort, and rule compliance. To balance hard safety constraints against soft progress objectives, we introduce fine-grained per-timestep reward conditioning and reward noise augmentation. The generative formulation naturally supports controllable test-time sampling via reward guidance and anchored sampling, producing high-quality proposals. FlowR2A achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks, with multimodal proposals of substantially higher quality than prior methods.