ChatPaper.aiChatPaper

自动驾驶中的视觉-语言-动作模型:过去、现在与未来

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

December 18, 2025
作者: Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, Xiaoshuai Hao, Linfeng Li, Hang Song, Xiangtai Li, Jun Ma, Shaojie Shen, Jianke Zhu, Dacheng Tao, Ziwei Liu, Junwei Liang
cs.AI

摘要

自動駕駛長期依賴模組化的「感知-決策-行動」流水線架構,其手工構建的接口與基於規則的組件在複雜或長尾場景中往往失效。這種級聯設計還會傳播感知誤差,導致下游規劃與控制性能衰退。視覺-行動模型通過學習從視覺輸入到動作的直接映射解決了部分局限,但仍存在可解釋性差、對分佈偏移敏感、缺乏結構化推理與指令跟隨能力等問題。近期大型語言模型與多模態學習的進展推動了視覺-語言-行動框架的興起,該框架將感知與基於語言的決策相融合。通過統一視覺理解、語言推理與可執行輸出,VLA為實現更具可解釋性、泛化性且符合人類意圖的駕駛策略開闢了新路徑。本文對新興的自動駕駛VLA領域進行系統化梳理:追溯從早期VA方法到現代VLA框架的演進歷程,將現有方法歸納為兩大範式——整合感知、推理與規劃的端到端VLA,以及將慢速決策(通過視覺語言模型)與快速安全關鍵執行(通過規劃器)分離的雙系統VLA。在此基礎上進一步區分文本型/數值型動作生成器、顯式/隱式引導機制等子類,總結用於評估VLA駕駛系統的代表性數據集與基準,並重點闡述包括魯棒性、可解釋性與指令保真度在內的關鍵挑戰與開放方向。本研究旨在為推進人機協同的自動駕駛系統建立統一的理論基礎。
English
Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and human-aligned driving policies. This work provides a structured characterization of the emerging VLA landscape for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks and organize existing methods into two principal paradigms: End-to-End VLA, which integrates perception, reasoning, and planning within a single model, and Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners). Within these paradigms, we further distinguish subclasses such as textual vs. numerical action generators and explicit vs. implicit guidance mechanisms. We also summarize representative datasets and benchmarks for evaluating VLA-based driving systems and highlight key challenges and open directions, including robustness, interpretability, and instruction fidelity. Overall, this work aims to establish a coherent foundation for advancing human-compatible autonomous driving systems.
PDF91December 19, 2025