評估人類和影像模型中的多視圖物件一致性

摘要

我們引入了一個基準來直接評估人類觀察者和視覺模型在3D形狀推斷任務上的對齊情況。我們利用認知科學中的實驗設計，該設計要求對物體形狀進行零樣本視覺推斷：參與者在給定一組圖像後，識別其中包含相同/不同物體，儘管視角變化很大。我們從包括常見物體（例如椅子）和抽象形狀（即程序生成的“無意義”物體）在內的各種圖像中汲取。在構建了2000多個獨特圖像集後，我們將這些任務交給人類參與者，從500多名參與者那裡收集了35,000次行為數據。這包括明確的選擇行為以及中間測量，如反應時間和凝視數據。然後，我們評估了常見視覺模型的性能（例如DINOv2、MAE、CLIP）。我們發現人類在性能上遠遠優於所有模型。使用多尺度評估方法，我們識別了模型和人類之間的潛在相似性和差異：儘管人類-模型性能存在相關性，但人類在挑戰性任務上分配了更多時間/處理。所有圖像、數據和代碼均可通過我們的項目頁面訪問。

English

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

評估人類和影像模型中的多視圖物件一致性

Evaluating Multiview Object Consistency in Humans and Image Models

摘要

Support