ChatPaper.aiChatPaper

Fast-dDrive:面向自动驾驶的高效块扩散视觉语言模型

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

May 25, 2026
作者: Kewei Zhang, Jin Wang, Sensen Gao, Chengyue Wu, Yulong Cao, Songyang Han, Boris Ivanovic, Langechuan Liu, Marco Pavone, Song Han, Daquan Zhou, Enze Xie
cs.AI

摘要

端到端自动驾驶中,视觉-语言-动作(VLA)模型需在高保真轨迹规划与高效推理之间达成微妙平衡。现有范式普遍存在不足:自回归(AR)型VLA在边缘硬件上受限于内存带宽,且易出现暴露偏差漂移;而全序列扩散模型则无法复用KV缓存,并存在违反"感知-规划"因果关系的"逻辑泄漏"问题。我们提出Fast-dDrive——一种在语义单元内进行双向精炼、同时跨单元强制执行严格因果顺序的块扩散VLA模型。基于驾驶VLA常输出结构化JSON格式的观察,Fast-dDrive将结构标记冻结为段落脚手架,并采用段落感知训练策略优先处理安全关键规划。我们进一步提出"脚手架推测解码"技术,以显著更高吞吐量实现与AR模型相当的质量。最后,我们设计一种低开销测试时扩展方案:通过从单一共享前缀KV缓存中分叉出N条随机轨迹展开并取平均,在极小计算开销下有效抑制预测方差。实验结果表明,Fast-dDrive重新定义了驾驶智能体的速度-精度前沿。在WOD-E2E测试集上,Fast-dDrive在3秒和5秒平均位移误差(ADE)指标上达到最高水平,且基于扩散的VLA中获得最高相对频率得分(RFS);在nuScenes数据集上,平均L2误差降至0.32米(提升22%)。与SGLang集成后,该框架相较AR基线实现12倍吞吐量提升,缩小了高容量VLA与实时车载部署效率需求间的差距。
English
End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking N stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to 0.32m (a 22% improvement). When integrated with SGLang, our framework delivers 12times throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.