Robusto-2: 리마와 뉴욕시에서의 자율주행을 위한 인간 및 비전-언어 모델 벤치마킹

초록

자율주행 차량이 국제적으로 확장되고 VLM과 같은 다중 모달 시스템을 행동 모델의 인지 백본으로 사용함에 따라, 이 시스템들이 새로운 환경, 특히 새로운 지역에서의 분포 외(OOD) 극단적 사례 시나리오에서 얼마나 잘 일반화될 수 있을까? 본 논문에서는 리마 출신 인간 운전자, 뉴욕시 출신 인간 운전자, 그리고 VLM을 대상으로 완전 요인 분석을 수행하고, 이들에게 리마와 뉴욕시에서 수집된 대시캠 영상을 제시한 후 시각 질문 응답(VQA) 패러다임 하에 다양한 질문을 제시함으로써 이 미해결 질문을 연구한다. 특히, 현재 자율주행 차량 업체가 운용되지 않는 매우 까다로운 운전 환경인 이 두 도시를 선정하였으며, 사실적 질문, 평가 질문, 반사실적 질문, 추론 질문의 4가지 범주에 걸쳐 질문을 구성하였다. 그 결과, 인간과 VLM은 응답에서 차이를 보였으며, 이는 질문 유형에 따라 조절되었고, 인간은 출신 지역(리마/뉴욕시)과 관계없이 유사하게 응답하는 것으로 나타났다. 놀랍게도, 지리적 요인에 의해 조절되는 인간 또는 VLM의 응답 측면에서 강한 차이는 발견되지 않았으며, 이는 이들의 높은 분포 외 특성 때문인 것으로 보인다. 본 데이터셋은 https://huggingface.co/datasets/Artificio/robusto-2에서 확인할 수 있다.

English

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2