GPT-4V(ision)と共に路上で：自動運転における視覚-言語モデルの初期探求

要旨

自動運転技術の追求は、知覚、意思決定、制御システムの高度な統合にかかっている。従来のデータ駆動型およびルールベースのアプローチは、複雑な運転環境のニュアンスや他の道路利用者の意図を把握できないという課題に直面してきた。これは、安全で信頼性の高い自動運転に必要な常識的推論や微妙なシーン理解の開発において、大きなボトルネックとなっている。視覚言語モデル（VLM）の登場は、完全な自動運転を実現するための新たなフロンティアを切り開くものである。本報告書では、最新の最先端VLMである\modelnamefullとその自動運転シナリオへの応用について、徹底的な評価を行っている。我々は、運転シーンの理解と推論、意思決定、そして最終的にはドライバーとしての行動能力について、モデルの能力を探求した。基本的なシーン認識から複雑な因果推論、さまざまな条件下でのリアルタイム意思決定まで、包括的なテストを実施した。その結果、\modelnameは既存の自動運転システムと比較して、シーン理解と因果推論において優れた性能を示すことが明らかになった。分布外シナリオの処理、意図の認識、実際の運転コンテキストでの情報に基づいた意思決定の可能性を示している。しかし、方向の識別、信号機の認識、視覚的基盤付け、空間推論タスクにおいては課題が残っている。これらの制約は、さらなる研究開発の必要性を浮き彫りにしている。本プロジェクトは、興味のある方々がアクセスして利用できるよう、GitHubで公開されている：https://github.com/PJLab-ADG/GPT4V-AD-Exploration

English

The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, \modelnamefull, and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that \modelname demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: https://github.com/PJLab-ADG/GPT4V-AD-Exploration

GPT-4V(ision)と共に路上で：自動運転における視覚-言語モデルの初期探求

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

要旨

Support