FlowR2A: 다중 모드 주행 계획을 위한 보상-행동 분포 학습

초록

다중 모드 주행 계획은 두 패러다임 간의 오랜 긴장 관계에 직면해 있다: 점수 기반 방법은 조밀한 보상 감독의 이점을 누리지만 고정된 행동 어휘에 국한되는 반면, 앵커 기반 방법은 동적으로 제안을 생성하지만 단일 실제 궤적으로 제한된 희소 감독의 문제를 겪는다. 본 연구에서는 시뮬레이션 기반 보상을 판별적 목표에서 생성적 조건으로 재구성함으로써 이러한 긴장을 해결하는 FlowR2A를 제안한다. 플로우 매칭 디코더를 사용하여 조밀한 궤적-보상 쌍으로부터 보상 조건부 행동 분포를 학습함으로써, FlowR2A는 점수 기반 방법의 조밀한 감독과 앵커 기반 방법의 제안 생성을 단일 생성 모델에서 통합하며, 모델이 안전성, 진행, 편안함 및 규칙 준수 측면에서 행동과 그 결과 간의 상관관계를 내재화하도록 강제한다. 엄격한 안전 제약과 완화된 진행 목표 간의 균형을 맞추기 위해, 세분화된 시간 단위별 보상 조건화와 보상 노이즈 증강을 도입한다. 생성적 공식은 보상 유도 및 앵커 샘플링을 통한 제어 가능한 테스트 시간 샘플링을 자연스럽게 지원하여 고품질 제안을 생성한다. FlowR2A는 NAVSIM v1 및 v2 벤치마크에서 최첨단 결과를 달성하며, 이전 방법보다 훨씬 더 높은 품질의 다중 모드 제안을 제공한다.

English

Multimodal driving planning faces a long-standing tension between two paradigms: scoring-based methods benefit from dense reward supervision but are confined to a fixed action vocabulary, while anchor-based methods generate proposals dynamically yet suffer from sparse supervision constrained to a single ground-truth trajectory. In this work, we propose FlowR2A, which resolves this tension by reframing simulation-based rewards from discriminative targets into generative conditions. By learning the reward-conditioned action distribution from dense trajectory-reward pairs with a flow-matching decoder, FlowR2A unifies the dense supervision of scoring-based methods with the proposal generation of anchor-based methods in a single generative model, forcing the model to internalize the correlation between an action and its outcomes in safety, progress, comfort, and rule compliance. To balance hard safety constraints against soft progress objectives, we introduce fine-grained per-timestep reward conditioning and reward noise augmentation. The generative formulation naturally supports controllable test-time sampling via reward guidance and anchored sampling, producing high-quality proposals. FlowR2A achieves state-of-the-art results on the NAVSIM v1 and v2 benchmarks, with multimodal proposals of substantially higher quality than prior methods.