離散擴散在自動駕駛中的反射式視覺-語言-動作模型應用
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving
September 24, 2025
作者: Pengxiang Li, Yinan Zheng, Yue Wang, Huimin Wang, Hang Zhao, Jingjing Liu, Xianyuan Zhan, Kun Zhan, Xianpeng Lang
cs.AI
摘要
端到端(E2E)解决方案已成为自动驾驶系统的主流方法,其中视觉-语言-动作(VLA)模型代表了一种新范式,它利用预训练的多模态知识从视觉-语言模型(VLM)中解读并与复杂的现实环境互动。然而,这些方法仍受限于模仿学习的固有缺陷,即在训练过程中难以有效编码物理规则。现有方法通常依赖于复杂的基于规则的后处理优化,采用仍主要局限于模拟的强化学习,或使用需要大量计算资源的扩散引导。为解决这些挑战,我们引入了ReflectDrive,一种新颖的学习框架,通过离散扩散整合了反射机制以生成安全轨迹。我们首先将二维驾驶空间离散化以构建动作代码本,从而能够通过微调使用预训练的扩散语言模型进行规划任务。我们方法的核心是一个安全感知的反射机制,它无需梯度计算即可进行迭代自我校正。我们的方法从目标条件轨迹生成开始,以建模多模态驾驶行为。在此基础上,我们应用局部搜索方法识别不安全标记并确定可行解决方案,这些解决方案随后作为基于修复的再生成的安全锚点。在NAVSIM基准测试中评估,ReflectDrive在安全关键轨迹生成方面展示了显著优势,为自动驾驶系统提供了一个可扩展且可靠的解决方案。
English
End-to-End (E2E) solutions have emerged as a mainstream approach for
autonomous driving systems, with Vision-Language-Action (VLA) models
representing a new paradigm that leverages pre-trained multimodal knowledge
from Vision-Language Models (VLMs) to interpret and interact with complex
real-world environments. However, these methods remain constrained by the
limitations of imitation learning, which struggles to inherently encode
physical rules during training. Existing approaches often rely on complex
rule-based post-refinement, employ reinforcement learning that remains largely
limited to simulation, or utilize diffusion guidance that requires
computationally expensive gradient calculations. To address these challenges,
we introduce ReflectDrive, a novel learning-based framework that integrates a
reflection mechanism for safe trajectory generation via discrete diffusion. We
first discretize the two-dimensional driving space to construct an action
codebook, enabling the use of pre-trained Diffusion Language Models for
planning tasks through fine-tuning. Central to our approach is a safety-aware
reflection mechanism that performs iterative self-correction without gradient
computation. Our method begins with goal-conditioned trajectory generation to
model multi-modal driving behaviors. Based on this, we apply local search
methods to identify unsafe tokens and determine feasible solutions, which then
serve as safe anchors for inpainting-based regeneration. Evaluated on the
NAVSIM benchmark, ReflectDrive demonstrates significant advantages in
safety-critical trajectory generation, offering a scalable and reliable
solution for autonomous driving systems.