自動運転のための視覚-言語-行動モデルに関するサーベイ

要旨

マルチモーダル大規模言語モデル（MLLM）の急速な進展により、視覚知覚、自然言語理解、制御を単一のポリシーに統合するVision-Language-Action（VLA）パラダイムへの道が開かれました。自動運転分野の研究者たちは、これらの手法を車両領域に積極的に適用しています。このようなモデルは、高レベルの指示を解釈し、複雑な交通シーンを推論し、独自の意思決定を行うことができる自動運転車を約束します。しかし、関連する文献は断片的であり、急速に拡大しています。本調査は、自動運転におけるVLA（VLA4AD）に関する初の包括的な概観を提供します。我々は、(i) 最近の研究に共通するアーキテクチャの構成要素を形式化し、(ii) 初期の説明モデルから推論中心のVLAモデルへの進化をたどり、(iii) 自動運転領域におけるVLAの進展に応じて20以上の代表的なモデルを比較します。また、既存のデータセットとベンチマークを統合し、運転の安全性、精度、説明品質を共同で測定するプロトコルを強調します。最後に、ロバスト性、リアルタイム効率性、形式的検証といった未解決の課題を詳細に述べ、VLA4ADの将来の方向性を概説します。本調査は、解釈可能で社会的に整合した自動運転車の進歩に向けた簡潔かつ完全なリファレンスを提供します。Githubリポジトリはhttps://github.com/JohnsonJiang1996/Awesome-VLA4AD{SicongJiang/Awesome-VLA4AD}で利用可能です。

English

The rapid progress of multimodal large language models (MLLM) has paved the way for Vision-Language-Action (VLA) paradigms, which integrate visual perception, natural language understanding, and control within a single policy. Researchers in autonomous driving are actively adapting these methods to the vehicle domain. Such models promise autonomous vehicles that can interpret high-level instructions, reason about complex traffic scenes, and make their own decisions. However, the literature remains fragmented and is rapidly expanding. This survey offers the first comprehensive overview of VLA for Autonomous Driving (VLA4AD). We (i) formalize the architectural building blocks shared across recent work, (ii) trace the evolution from early explainer to reasoning-centric VLA models, and (iii) compare over 20 representative models according to VLA's progress in the autonomous driving domain. We also consolidate existing datasets and benchmarks, highlighting protocols that jointly measure driving safety, accuracy, and explanation quality. Finally, we detail open challenges - robustness, real-time efficiency, and formal verification - and outline future directions of VLA4AD. This survey provides a concise yet complete reference for advancing interpretable socially aligned autonomous vehicles. Github repo is available at https://github.com/JohnsonJiang1996/Awesome-VLA4AD{SicongJiang/Awesome-VLA4AD}.

自動運転のための視覚-言語-行動モデルに関するサーベイ

A Survey on Vision-Language-Action Models for Autonomous Driving

要旨

Support