자율 주행을 위한 반사적 시각-언어-행동 모델을 위한 이산 확산

초록

엔드투엔드(E2E) 솔루션은 자율주행 시스템을 위한 주류 접근 방식으로 부상했으며, 비전-언어-액션(VLA) 모델은 비전-언어 모델(VLM)에서 사전 학습된 다중모달 지식을 활용하여 복잡한 현실 세계 환경을 해석하고 상호작용하는 새로운 패러다임을 대표합니다. 그러나 이러한 방법들은 여전히 물리적 규칙을 본질적으로 인코딩하는 데 어려움을 겪는 모방 학습의 한계에 의해 제약을 받고 있습니다. 기존 접근 방식들은 복잡한 규칙 기반 사후 정제에 의존하거나, 시뮬레이션에 크게 제한된 강화 학습을 사용하거나, 계산적으로 비용이 많이 드는 그래디언트 계산이 필요한 확산 가이던스를 활용하는 경우가 많습니다. 이러한 문제를 해결하기 위해, 우리는 이산 확산을 통해 안전한 궤적 생성을 위한 반사 메커니즘을 통합한 새로운 학습 기반 프레임워크인 ReflectDrive를 소개합니다. 먼저, 2차원 주행 공간을 이산화하여 액션 코드북을 구성하고, 이를 통해 사전 학습된 확산 언어 모델을 미세 조정하여 계획 작업에 사용할 수 있도록 합니다. 우리의 접근 방식의 핵심은 그래디언트 계산 없이 반복적인 자기 수정을 수행하는 안전 인식 반사 메커니즘입니다. 우리의 방법은 목표 조건 궤적 생성을 시작으로 다중모달 주행 행동을 모델링합니다. 이를 기반으로, 지역 탐색 방법을 적용하여 안전하지 않은 토큰을 식별하고 실행 가능한 솔루션을 결정하며, 이는 인페인팅 기반 재생성을 위한 안전한 앵커 역할을 합니다. NAVSIM 벤치마크에서 평가된 ReflectDrive는 안전-중요 궤적 생성에서 상당한 이점을 보여주며, 자율주행 시스템을 위한 확장 가능하고 신뢰할 수 있는 솔루션을 제공합니다.

English

End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

자율 주행을 위한 반사적 시각-언어-행동 모델을 위한 이산 확산

Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

초록

Support