GPT-4V(ision)와 함께하는 자율주행 여정: 시각-언어 모델의 초기 탐구

초록

자율주행 기술의 추구는 인지, 의사결정, 제어 시스템의 정교한 통합에 달려 있습니다. 전통적인 데이터 기반 및 규칙 기반 접근 방식은 복잡한 주행 환경의 미묘한 차이와 다른 도로 사용자의 의도를 파악하지 못함으로써 한계를 드러냈습니다. 이는 특히 안전하고 신뢰할 수 있는 자율주행을 위해 필수적인 상식적 추론과 미묘한 장면 이해의 개발에 있어 상당한 병목 현상으로 작용했습니다. 시각 언어 모델(Visual Language Model, VLM)의 등장은 완전한 자율주행 차량 실현을 위한 새로운 지평을 열었습니다. 본 보고서는 최신 최첨단 VLM인 \modelnamefull의 자율주행 시나리오 적용에 대한 철저한 평가를 제공합니다. 우리는 이 모델이 주행 장면을 이해하고 추론하며, 의사결정을 내리고, 궁극적으로 운전자로서 행동할 수 있는 능력을 탐구합니다. 우리의 포괄적인 테스트는 기본적인 장면 인식부터 복잡한 인과적 추론 및 다양한 조건 하의 실시간 의사결정에 이르기까지 광범위하게 진행되었습니다. 연구 결과, \modelname은 기존 자율주행 시스템에 비해 장면 이해와 인과적 추론에서 우수한 성능을 보여주었습니다. 이 모델은 분포 외(out-of-distribution) 시나리오를 처리하고, 의도를 인식하며, 실제 주행 상황에서 정보에 기반한 결정을 내릴 수 있는 잠재력을 보여줍니다. 그러나 방향 판별, 신호등 인식, 시각적 근거화, 공간적 추론 작업 등에서 여전히 과제가 남아 있습니다. 이러한 한계는 추가 연구와 개발의 필요성을 강조합니다. 관심 있는 분들을 위해 프로젝트는 GitHub에서 접근 및 활용이 가능합니다: https://github.com/PJLab-ADG/GPT4V-AD-Exploration

English

The pursuit of autonomous driving technology hinges on the sophisticated integration of perception, decision-making, and control systems. Traditional approaches, both data-driven and rule-based, have been hindered by their inability to grasp the nuance of complex driving environments and the intentions of other road users. This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving. The advent of Visual Language Models (VLM) represents a novel frontier in realizing fully autonomous vehicle driving. This report provides an exhaustive evaluation of the latest state-of-the-art VLM, \modelnamefull, and its application in autonomous driving scenarios. We explore the model's abilities to understand and reason about driving scenes, make decisions, and ultimately act in the capacity of a driver. Our comprehensive tests span from basic scene recognition to complex causal reasoning and real-time decision-making under varying conditions. Our findings reveal that \modelname demonstrates superior performance in scene understanding and causal reasoning compared to existing autonomous systems. It showcases the potential to handle out-of-distribution scenarios, recognize intentions, and make informed decisions in real driving contexts. However, challenges remain, particularly in direction discernment, traffic light recognition, vision grounding, and spatial reasoning tasks. These limitations underscore the need for further research and development. Project is now available on GitHub for interested parties to access and utilize: https://github.com/PJLab-ADG/GPT4V-AD-Exploration

GPT-4V(ision)와 함께하는 자율주행 여정: 시각-언어 모델의 초기 탐구

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

초록

Support