Robusto-2:在利马与纽约市对自动驾驶中的人类与视觉语言模型进行基准测试
Robusto-2: Benchmarking Humans & VLMs for Autonomous Driving in Lima & New York City
June 18, 2026
作者: Adrian Cespedes, Marcelo Chincha, Dunant Cusipuma, Victor Flores-Benites, David Ortega, Arturo Deza
cs.AI
摘要
随着自动驾驶汽车在国际上的持续扩展,并采用多模态系统(如视觉语言模型VLM)作为其动作模型的认知核心,这些系统在新环境中的泛化能力如何?特别是在新地理区域的分布外(OOD)边缘场景中表现如何?本文针对这一开放性问题展开了全因子分析,研究对象包括利马的人类驾驶员、纽约市的人类驾驶员以及视觉语言模型,并向他们展示了分别采集自利马和纽约的行车记录仪视频——在视觉问答(VQA)范式下提出多种类型的问题。我们特意选择了这两个极具驾驶挑战性的城市(目前尚无自动驾驶汽车公司在此运营),问题涵盖四大类别:事实型、评分型、反事实型和推理型。我们发现,人类与VLM在回答上存在分歧——尽管这种差异受到问题类型的影响,而人类(无论来自利马还是纽约)的回答则较为相似。出乎意料的是,我们并未发现人类或VLM的回答因地理因素而产生显著差异,这很可能归因于其高度的分布外特性。我们的数据集可通过以下链接获取:https://huggingface.co/datasets/Artificio/robusto-2
English
As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2