Robusto-2: Benchmarking von Menschen und VLMs für autonomes Fahren in Lima und New York City

Zusammenfassung

Während autonome Fahrzeuge international expandieren und multimodale Systeme wie VLMs als kognitive Grundlage für ihre Aktionsmodelle nutzen: Wie gut werden diese Systeme in neuen Umgebungen generalisieren, insbesondere in Out-of-Distribution (OOD)-Randfall-Szenarien in neuen geografischen Regionen? In dieser Arbeit untersuchen wir diese offene Frage, indem wir eine vollständige faktorielle Analyse mit menschlichen Fahrern aus Lima, menschlichen Fahrern aus New York City und VLMs durchführen und ihnen Dashcam-Aufnahmen zeigen, die in Lima und New York City gesammelt wurden – wobei wir sie mit einer Vielzahl von Fragen im Rahmen eines Paradigmas der Visuellen Fragenbeantwortung (VQA) konfrontieren. Insbesondere wählen wir diese beiden Städte, da es sich um äußerst anspruchsvolle Fahrorte handelt, in denen derzeit kein Unternehmen für autonome Fahrzeuge tätig ist, und stellen Fragen, die sich über vier Kategorien erstrecken: Fakten, Bewertungen, Kontrafaktisches und Schlussfolgerungen. Wir stellen fest, dass Menschen und VLMs in ihren Antworten divergieren – wobei dies durch die Art der gestellten Fragen moduliert wird – und dass Menschen unabhängig von ihrer Herkunft (Lima/NYC) ähnlich antworten. Zu unserer Überraschung fanden wir keinen starken Unterschied in den Antworten (von Menschen oder VLMs), der durch die Geografie moduliert wurde, was wahrscheinlich auf deren hohen Out-of-Distribution-Charakter zurückzuführen ist. Unser Datensatz ist verfügbar unter: https://huggingface.co/datasets/Artificio/robusto-2

English

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2