Fast-dDrive: 自動運転のための効率的なブロック拡散VLM

要旨

Vision-Language-Action（VLA）モデルによるエンドツーエンド自動運転は、高忠実度の軌道計画と効率的な推論の間で微妙なバランスを取る必要がある。既存のパラダイムは一般的に不十分である。すなわち、自己回帰（AR）型VLAはエッジハードウェア上でメモリ帯域幅に制約され、露出バイアスによるドリフトを起こしやすい。一方、全系列拡散モデルはKVキャッシュの再利用を排除し、「知覚→計画」という基本的な因果関係に違反する「論理的漏洩」を被る。本稿では、セマンティック単位内で双方向洗練を実行しつつ、それらの間で厳密な因果順序を強制するブロック拡散VLAであるFast-dDriveを提案する。運転用VLAが構造化JSON風出力を生成することが多いという観察に基づき、Fast-dDriveは構造トークンをセクションスキャフォールドに固定し、安全重要計画を優先するセクション認識型訓練レシピを採用する。さらに、ARと同等の品質を大幅に高いスループットで達成するスキャフォールド投機デコードを導入する。最後に、低オーバーヘッドなテスト時スケーリング手法を提案する。すなわち、単一の共有プレフィックスKVキャッシュからN個の確率的軌跡ロールアウトをフォークし、それらを平均することで、ごくわずかな計算コストで予測分散を効果的に抑制する。実験結果は、Fast-dDriveが運転エージェントの速度-精度フロンティアを再定義することを示す。WOD-E2Eテストセットにおいて、Fast-dDriveはSOTAのADE@3sおよびADE@5sを達成し、拡散ベースVLAの中で最高のRFSを記録する。nuScenesでは、平均L2誤差を0.32m（22%改善）に低減する。SGLangと統合した場合、本フレームワークはARベースライン比12倍のスループット高速化を実現し、高容量VLAと実車載リアルタイム展開の効率要求との間のギャップを縮める。

English

End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking N stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to 0.32m (a 22% improvement). When integrated with SGLang, our framework delivers 12times throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.