ChatPaper.aiChatPaper

Robusto-1 數據集:比較人類與視覺語言模型在秘魯真實分佈外場景下的自動駕駛視覺問答表現

Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru

March 10, 2025
作者: Dunant Cusipuma, David Ortega, Victor Flores-Benites, Arturo Deza
cs.AI

摘要

隨著多模態基礎模型開始在自駕車中進行實驗性部署,我們不禁要問:這些系統在特定駕駛情境下的反應與人類有多相似——尤其是在那些分佈外(out-of-distribution)的情境中?為此,我們創建了Robusto-1數據集,該數據集使用了來自秘魯的行車記錄儀視頻數據。秘魯是全球駕駛行為最為激進的國家之一,交通指數高,且街道上出現的奇特物體與非奇特物體的比例極高,這些物體很可能從未在訓練中出現。特別地,為了初步測試基礎視覺語言模型(VLMs)在駕駛情境中與人類在認知層面的比較,我們摒棄了邊界框、分割圖、佔用圖或軌跡估計等方法,轉而採用多模態視覺問答(VQA),並通過系統神經科學中常用的表徵相似性分析(RSA)來比較人類與機器的表現。根據我們提出的問題類型以及這些系統給出的答案,我們將展示在哪些情況下VLMs與人類會趨同或分歧,從而探討它們的認知對齊程度。我們發現,對齊程度會根據向每種類型系統(人類與VLMs)提出的問題類型而顯著變化,這凸顯了它們在對齊上的差距。
English
As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

Summary

AI-Generated Summary

PDF112March 12, 2025