Fast-dDrive: 자율 주행을 위한 효율적인 블록 확산 VLM

초록

엔드-투-엔드 자율 주행을 위한 비전-언어-행동(VLA) 모델은 고충실도 궤적 계획과 효율적 추론 사이의 불안정한 균형을 요구한다. 기존 패러다임은 일반적으로 부족한 점이 있다: 자기회귀(AR) VLA는 엣지 하드웨어에서 메모리 대역폭에 제약을 받고 노출 편향 드리프트에 취약한 반면, 전체 시퀀스 확산 모델은 KV-캐시 재사용을 불가능하게 하고 기본적인 인지-후-계획 인과관계를 위반하는 "논리적 누출"을 겪는다. 본 논문에서는 Fast-dDrive를 제안한다. 이는 의미 단위 내에서 양방향 정제를 수행하면서 단위 간에 엄격한 인과적 순서를 강제하는 블록 확산 VLA이다. 주행 VLA가 종종 구조화된 JSON 유사 출력을 생성한다는 관찰을 활용하여, Fast-dDrive는 구조적 토큰을 섹션 스캐폴드로 고정하고 안전-중요 계획을 우선시하는 섹션 인식 훈련 방법을 채택한다. 또한, AR 동등 품질을 현저히 높은 처리량으로 달성하기 위해 스캐폴드 추측 디코딩을 도입한다. 마지막으로, 단일 공유 프리픽스 KV 캐시에서 N개의 확률적 궤적 롤아웃을 포킹하고 이를 평균화함으로써 극히 적은 계산 비용으로 예측 분산을 효과적으로 억제하는 낮은 오버헤드의 테스트 시간 스케일링 기법을 제안한다. 실험 결과는 Fast-dDrive가 주행 에이전트의 속도-정확도 경계를 재정의함을 보여준다. WOD-E2E 테스트 세트에서 Fast-dDrive는 최첨단 ADE@3s 및 ADE@5s를 달성하고, 확산 기반 VLA 중 가장 높은 RFS를 기록한다. nuScenes에서는 평균 L2 오차를 0.32m로 감소시켜 22% 개선을 달성한다. SGLang과 통합 시, 본 프레임워크는 AR 기준선 대비 12배의 처리량 가속을 제공하여 고용량 VLA와 실시간 차량 탑재 배포의 효율성 요구 사이의 격차를 좁힌다.

English

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking N stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to 0.32m (a 22% improvement). When integrated with SGLang, our framework delivers 12times throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.