FlowR2A: マルチモーダル運転計画のための報酬から行動への分布学習

要旨

マルチモーダル運転計画は、長年にわたり二つのパラダイム間の緊張関係に直面してきた。スコアベース手法は密な報酬監視の恩恵を受ける一方で固定された動作語彙に制約され、アンカーベース手法は動的に提案を生成するものの、単一の正解軌跡に制限された疎な監視に悩まされる。本研究では、この緊張関係を解消するため、シミュレーションベースの報酬を識別的な目標から生成的条件へと再構成するFlowR2Aを提案する。フローマッチングデコーダを用いて密な軌跡-報酬ペアから報酬条件付き行動分布を学習することで、FlowR2Aはスコアベース手法の密な監視とアンカーベース手法の提案生成を単一の生成モデル内で統一し、安全性、進行度、快適性、ルール遵守において行動とその結果との相関をモデルに内部化させる。厳格な安全性制約と緩やかな進行目標のバランスを取るために、細粒度のタイムステップごとの報酬条件付けと報酬ノイズ拡張を導入する。この生成的定式化は、報酬ガイダンスとアンカーサンプリングによるテスト時の制御可能なサンプリングを自然にサポートし、高品質な提案を生成する。FlowR2AはNAVSIM v1およびv2ベンチマークで最先端の結果を達成し、従来手法よりも大幅に高品質なマルチモーダル提案を実現する。

English

Multimodal driving planning faces a long-standing tension between two paradigms: scoring-based methods benefit from dense reward supervision but are confined to a fixed action vocabulary, while anchor-based methods generate proposals dynamically yet suffer from sparse supervision constrained to a single ground-truth trajectory. In this work, we propose FlowR2A, which resolves this tension by reframing simulation-based rewards from discriminative targets into generative conditions. By learning the reward-conditioned action distribution from dense trajectory-reward pairs with a flow-matching decoder, FlowR2A unifies the dense supervision of scoring-based methods with the proposal generation of anchor-based methods in a single generative model, forcing the model to internalize the correlation between an action and its outcomes in safety, progress, comfort, and rule compliance. To balance hard safety constraints against soft progress objectives, we introduce fine-grained per-timestep reward conditioning and reward noise augmentation. The generative formulation naturally supports controllable test-time sampling via reward guidance and anchored sampling, producing high-quality proposals. FlowR2A achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks, with multimodal proposals of substantially higher quality than prior methods.