隨著GPT-4V(ision)一同探索:自駕車視覺語言模型的初步研究
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
November 9, 2023
作者: Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi
cs.AI
摘要
實現自動駕駛技術的追求取決於對感知、決策和控制系統的精密整合。傳統方法,無論是數據驅動還是基於規則,都受到其無法理解複雜駕駛環境和其他道路使用者意圖的限制。這在發展常識推理和對於安全可靠的自動駕駛所必需的微妙場景理解方面一直是一個重要瓶頸。視覺語言模型(VLM)的出現代表了實現完全自主車輛駕駛的一個新範疇。本報告對最新的 VLM 技術 \modelnamefull 進行了詳盡評估,以及其在自動駕駛場景中的應用。我們探討了該模型理解和推理駕駛場景、做出決策,最終以司機的身份行動的能力。我們全面的測試從基本場景識別到複雜因果推理和在不同條件下的實時決策。我們的研究發現顯示,與現有自動駕駛系統相比,\modelname 在場景理解和因果推理方面表現出優越性。它展示了處理超出分布範疇的情景、識別意圖並在實際駕駛情境中做出明智決策的潛力。然而,仍然存在挑戰,特別是在方向識別、交通燈識別、視覺基礎和空間推理任務方面。這些限制強調了進一步研究和開發的必要性。有興趣的各方現在可以在 GitHub 上訪問並使用該項目:https://github.com/PJLab-ADG/GPT4V-AD-Exploration
English
The pursuit of autonomous driving technology hinges on the sophisticated
integration of perception, decision-making, and control systems. Traditional
approaches, both data-driven and rule-based, have been hindered by their
inability to grasp the nuance of complex driving environments and the
intentions of other road users. This has been a significant bottleneck,
particularly in the development of common sense reasoning and nuanced scene
understanding necessary for safe and reliable autonomous driving. The advent of
Visual Language Models (VLM) represents a novel frontier in realizing fully
autonomous vehicle driving. This report provides an exhaustive evaluation of
the latest state-of-the-art VLM, \modelnamefull, and its application in
autonomous driving scenarios. We explore the model's abilities to understand
and reason about driving scenes, make decisions, and ultimately act in the
capacity of a driver. Our comprehensive tests span from basic scene recognition
to complex causal reasoning and real-time decision-making under varying
conditions. Our findings reveal that \modelname demonstrates superior
performance in scene understanding and causal reasoning compared to existing
autonomous systems. It showcases the potential to handle out-of-distribution
scenarios, recognize intentions, and make informed decisions in real driving
contexts. However, challenges remain, particularly in direction discernment,
traffic light recognition, vision grounding, and spatial reasoning tasks. These
limitations underscore the need for further research and development. Project
is now available on GitHub for interested parties to access and utilize:
https://github.com/PJLab-ADG/GPT4V-AD-Exploration