Fast-dDrive：高效區塊擴散視覺語言模型應用於自動駕駛

摘要

端到端自动驾驶中基于视觉-语言-动作（VLA）模型的方法，需要在高保真轨迹规划与高效推理之间维持微妙的平衡。现有范式存在明显不足：自回归（AR）型VLA在边缘硬件上受内存带宽限制，且易出现曝光偏差漂移；而全序列扩散模型无法复用KV缓存，并存在违反“感知-规划”因果链的“逻辑泄漏”问题。本文提出Fast-dDrive——一种分块扩散VLA，其在语义单元内部执行双向精细化处理，同时在整个流程中强制执行严格因果顺序。基于驾驶VLA常输出结构化JSON格式数据的观察，Fast-dDrive将结构标记冻结为章节支架，并采用感知安全关键规划的章节感知训练策略。我们进一步提出支架投机解码（Scaffold Speculative Decoding），以更高吞吐量实现与AR模型等效的质量。此外，我们提出一种低开销的测试时扩展方案：从单一共享前缀KV缓存分叉N条随机轨迹展开，通过平均化处理，以极小计算成本有效抑制预测方差。实验结果表明，Fast-dDrive重新定义了驾驶智能体的速度-精度前沿。在WOD-E2E测试集上，Fast-dDrive在ADE@3s和ADE@5s指标上达到最优，同时实现扩散VLA中最高的RFS；在nuScenes数据集上，平均L2误差降至0.32米（提升22%）。与SGLang集成后，本框架的吞吐量相比AR基线提升12倍，缩小了高容量VLA与车载实时部署效率需求之间的差距。

English

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking N stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to 0.32m (a 22% improvement). When integrated with SGLang, our framework delivers 12times throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.