Robusto-2: リマとニューヨーク市における自動運転のための人間とVLMのベンチマーキング

要旨

自動運転車が国際的に普及し、VLMなどのマルチモーダルシステムを行動モデルの認知バックボーンとして使用するようになるにつれ、これらのシステムは新しい環境、特に新しい地理における分布外（OOD）のエッジケースシナリオでどの程度一般化するだろうか。本稿では、リマの人間ドライバー、ニューヨーク市出身の人間ドライバー、そしてVLMを用いた完全要因分析を提供し、リマとニューヨーク市で収集されたドライブレコーダーの映像を提示し、視覚的質問応答（VQA）パラダイムのもとで多様な質問を促してこの未解決の問いを研究する。具体的には、現在自動運転車会社が運行していない非常に運転が難しい二都市を選び、事実確認、評価、反事実、推論の4カテゴリにわたる質問をした。その結果、人間とVLMの回答は乖離することがわかった。ただし、これは質問の種類によって調整され、人間は出身地（リマ/NYC）に関わらず同様に回答した。驚くべきことに、地理によって調整される回答（人間またはVLM）に大きな差は見られなかった。これはおそらく、その高い分布外性によるものと考えられる。データセットは以下で公開している：https://huggingface.co/datasets/Artificio/robusto-2

English

As Self-Driving Cars continue to expand internationally and use multi-modal systems such as VLMs as a cognitive backbone for their Action models; how well will these systems generalize in new settings, in particular out-of-distribution (OOD) edge-case scenarios in new geographies? In this paper, we study this open question by providing a full factorial analysis with human drivers of Lima, human drivers from New York City, and VLMs and showing them dashcam footage collected from Lima and New York City -- prompting them with a variety of questions under a Visual Question Answering (VQA) paradigm. In particular, we pick these two cities as they are highly challenging driving locations where no Self-Driving Car company currently operates in, and ask questions that span 4 categories: Factual, Ratings, Counterfactual and Reasoning. We find that Humans and VLMs diverge in their responses -- though this is modulated by the type of questions asked, and that Humans answer similarly independent of where they are from (Lima/NYC). To our surprise, we did not find a strong difference in terms of answers (Humans or VLMs) that was modulated by geography, likely due to their high out-of-distribution nature. Our dataset is available at: https://huggingface.co/datasets/Artificio/robusto-2