自动驾驶视觉-语言-动作模型研究综述

摘要

多模态大语言模型（MLLM）的快速发展为视觉-语言-行动（VLA）范式铺平了道路，该范式将视觉感知、自然语言理解与控制策略整合于一体。自动驾驶领域的研究人员正积极将这些方法应用于车辆领域。此类模型有望使自动驾驶汽车能够解析高级指令、推理复杂交通场景并自主决策。然而，相关文献仍较为分散且迅速扩展。本综述首次全面概述了面向自动驾驶的VLA（VLA4AD）。我们（i）形式化了近期工作中共享的架构构建模块，（ii）追溯了从早期解释器到以推理为核心的VLA模型的演变历程，以及（iii）根据VLA在自动驾驶领域的进展，比较了超过20个代表性模型。我们还整合了现有数据集与基准测试，重点介绍了同时衡量驾驶安全性、准确性和解释质量的评估协议。最后，我们详述了开放挑战——鲁棒性、实时效率与形式化验证——并勾勒了VLA4AD的未来发展方向。本综述为推进可解释且社会对齐的自动驾驶汽车提供了简洁而完整的参考。GitHub仓库地址为https://github.com/JohnsonJiang1996/Awesome-VLA4AD{SicongJiang/Awesome-VLA4AD}。

English

The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at https://github.com/JohnsonJiang1996/Awesome-VLA4AD{SicongJiang/Awesome-VLA4AD}.

自动驾驶视觉-语言-动作模型研究综述

A Survey on Vision-Language-Action Models for Autonomous Driving

摘要

Support