評估人類和影像模型中的多視圖物件一致性
Evaluating Multiview Object Consistency in Humans and Image Models
September 9, 2024
作者: Tyler Bonnen, Stephanie Fu, Yutong Bai, Thomas O'Connell, Yoni Friedman, Nancy Kanwisher, Joshua B. Tenenbaum, Alexei A. Efros
cs.AI
摘要
我們引入了一個基準來直接評估人類觀察者和視覺模型在3D形狀推斷任務上的對齊情況。我們利用認知科學中的實驗設計,該設計要求對物體形狀進行零樣本視覺推斷:參與者在給定一組圖像後,識別其中包含相同/不同物體,儘管視角變化很大。我們從包括常見物體(例如椅子)和抽象形狀(即程序生成的“無意義”物體)在內的各種圖像中汲取。在構建了2000多個獨特圖像集後,我們將這些任務交給人類參與者,從500多名參與者那裡收集了35,000次行為數據。這包括明確的選擇行為以及中間測量,如反應時間和凝視數據。然後,我們評估了常見視覺模型的性能(例如DINOv2、MAE、CLIP)。我們發現人類在性能上遠遠優於所有模型。使用多尺度評估方法,我們識別了模型和人類之間的潛在相似性和差異:儘管人類-模型性能存在相關性,但人類在挑戰性任務上分配了更多時間/處理。所有圖像、數據和代碼均可通過我們的項目頁面訪問。
English
We introduce a benchmark to directly evaluate the alignment between human
observers and vision models on a 3D shape inference task. We leverage an
experimental design from the cognitive sciences which requires zero-shot visual
inferences about object shape: given a set of images, participants identify
which contain the same/different objects, despite considerable viewpoint
variation. We draw from a diverse range of images that include common objects
(e.g., chairs) as well as abstract shapes (i.e., procedurally generated
`nonsense' objects). After constructing over 2000 unique image sets, we
administer these tasks to human participants, collecting 35K trials of
behavioral data from over 500 participants. This includes explicit choice
behaviors as well as intermediate measures, such as reaction time and gaze
data. We then evaluate the performance of common vision models (e.g., DINOv2,
MAE, CLIP). We find that humans outperform all models by a wide margin. Using a
multi-scale evaluation approach, we identify underlying similarities and
differences between models and humans: while human-model performance is
correlated, humans allocate more time/processing on challenging trials. All
images, data, and code can be accessed via our project page.Summary
AI-Generated Summary