随着GPT-4V(ision)在自动驾驶领域的早期探索: 视觉-语言模型的研究
On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving
November 9, 2023
作者: Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi
cs.AI
摘要
自动驾驶技术的追求取决于感知、决策和控制系统的复杂集成。传统方法,无论是数据驱动还是基于规则的方法,都受到了无法理解复杂驾驶环境和其他道路使用者意图的限制。这在发展常识推理和细致场景理解方面是一个重要瓶颈,这对于安全可靠的自动驾驶至关重要。视觉语言模型(VLM)的出现代表了实现完全自动驾驶的新领域。本报告对最新的顶尖VLM模型 \modelnamefull 及其在自动驾驶场景中的应用进行了详尽评估。我们探讨了该模型理解和推理驾驶场景、做出决策,并最终扮演司机角色的能力。我们的全面测试涵盖了从基本场景识别到复杂因果推理以及在不同条件下的实时决策。我们的研究结果显示,与现有自动驾驶系统相比,\modelname 在场景理解和因果推理方面表现出卓越性能。它展示了处理超出分布范围场景、识别意图并在实际驾驶环境中做出明智决策的潜力。然而,仍然存在挑战,特别是在方向识别、交通信号识别、视觉基础和空间推理任务方面。这些限制突显了进一步研究和发展的必要性。该项目现已在 GitHub 上提供,供有兴趣的人访问和利用:https://github.com/PJLab-ADG/GPT4V-AD-Exploration
English
The pursuit of autonomous driving technology hinges on the sophisticated
integration of perception, decision-making, and control systems. Traditional
approaches, both data-driven and rule-based, have been hindered by their
inability to grasp the nuance of complex driving environments and the
intentions of other road users. This has been a significant bottleneck,
particularly in the development of common sense reasoning and nuanced scene
understanding necessary for safe and reliable autonomous driving. The advent of
Visual Language Models (VLM) represents a novel frontier in realizing fully
autonomous vehicle driving. This report provides an exhaustive evaluation of
the latest state-of-the-art VLM, \modelnamefull, and its application in
autonomous driving scenarios. We explore the model's abilities to understand
and reason about driving scenes, make decisions, and ultimately act in the
capacity of a driver. Our comprehensive tests span from basic scene recognition
to complex causal reasoning and real-time decision-making under varying
conditions. Our findings reveal that \modelname demonstrates superior
performance in scene understanding and causal reasoning compared to existing
autonomous systems. It showcases the potential to handle out-of-distribution
scenarios, recognize intentions, and make informed decisions in real driving
contexts. However, challenges remain, particularly in direction discernment,
traffic light recognition, vision grounding, and spatial reasoning tasks. These
limitations underscore the need for further research and development. Project
is now available on GitHub for interested parties to access and utilize:
https://github.com/PJLab-ADG/GPT4V-AD-Exploration