自动驾驶中的视觉-语言-动作模型:过去、现在与未来
Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future
December 18, 2025
作者: Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, Junwei Liang
cs.AI
摘要
自动驾驶技术长期依赖于模块化的“感知-决策-执行”流程,其手工构建的接口与基于规则的组件在复杂或长尾场景中往往失效。这种级联设计还会传播感知误差,导致下游规划与控制性能下降。视觉-行动模型通过从视觉输入到行动的直接映射学习解决了部分局限,但仍存在可解释性差、对分布偏移敏感、缺乏结构化推理与指令跟随能力等问题。随着大语言模型与多模态学习的进展,视觉-语言-行动框架应运而生,该框架将感知与基于语言的决策相融合。通过统一视觉理解、语言推理与可执行输出,VLA为构建更具可解释性、泛化性且符合人类价值观的驾驶策略提供了路径。本文对这一新兴领域进行了系统梳理:追溯了从早期VA方法到现代VLA框架的演进脉络,将现有方法归纳为两大范式——集成感知、推理与规划于一体的端到端VLA,以及分离慢速决策(通过视觉语言模型)与快速安全关键执行(通过规划器)的双系统VLA。在此基础上,我们进一步区分了文本型与数值型动作生成器、显式与隐式引导机制等子类,总结了用于评估VLA驾驶系统的代表性数据集与基准测试,并指出了包括鲁棒性、可解释性与指令忠实度在内的关键挑战与开放问题。本研究旨为推进人机协同的自动驾驶系统建立统一的理论基础。
English
Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and human-aligned driving policies. This work provides a structured characterization of the emerging VLA landscape for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks and organize existing methods into two principal paradigms: End-to-End VLA, which integrates perception, reasoning, and planning within a single model, and Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners). Within these paradigms, we further distinguish subclasses such as textual vs. numerical action generators and explicit vs. implicit guidance mechanisms. We also summarize representative datasets and benchmarks for evaluating VLA-based driving systems and highlight key challenges and open directions, including robustness, interpretability, and instruction fidelity. Overall, this work aims to establish a coherent foundation for advancing human-compatible autonomous driving systems.