Robusto-2：在利马与纽约市对自动驾驶中的人类与视觉语言模型进行基准测试

摘要

随着自动驾驶汽车在国际上的持续扩展，并采用多模态系统（如视觉语言模型VLM）作为其动作模型的认知核心，这些系统在新环境中的泛化能力如何？特别是在新地理区域的分布外（OOD）边缘场景中表现如何？本文针对这一开放性问题展开了全因子分析，研究对象包括利马的人类驾驶员、纽约市的人类驾驶员以及视觉语言模型，并向他们展示了分别采集自利马和纽约的行车记录仪视频——在视觉问答（VQA）范式下提出多种类型的问题。我们特意选择了这两个极具驾驶挑战性的城市（目前尚无自动驾驶汽车公司在此运营），问题涵盖四大类别：事实型、评分型、反事实型和推理型。我们发现，人类与VLM在回答上存在分歧——尽管这种差异受到问题类型的影响，而人类（无论来自利马还是纽约）的回答则较为相似。出乎意料的是，我们并未发现人类或VLM的回答因地理因素而产生显著差异，这很可能归因于其高度的分布外特性。我们的数据集可通过以下链接获取：https://huggingface.co/datasets/Artificio/robusto-2

English

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2