评估人类和图像模型中的多视图对象一致性

摘要

我们引入了一个基准来直接评估人类观察者和视觉模型在3D形状推断任务上的对齐情况。我们利用了认知科学中的实验设计，该设计要求关于物体形状的零照射视觉推断：在给定一组图像的情况下，参与者需要识别哪些图像包含相同/不同的物体，尽管视角有很大变化。我们利用了包括常见物体（例如椅子）和抽象形状（即，程序生成的“无意义”对象）在内的各种图像。在构建了2000多个独特图像集之后，我们向人类参与者提供这些任务，从500多名参与者那里收集了35K次行为数据。这包括明确的选择行为以及诸如反应时间和凝视数据之类的中间指标。然后，我们评估了常见视觉模型的性能（例如，DINOv2，MAE，CLIP）。我们发现人类在各个方面都远远优于所有模型。使用多尺度评估方法，我们确定了模型和人类之间的潜在相似性和差异：虽然人类-模型的表现是相关的，但人类在挑战性试验上分配了更多时间/处理。所有图像、数据和代码都可以通过我们的项目页面访问。

English

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

评估人类和图像模型中的多视图对象一致性

Evaluating Multiview Object Consistency in Humans and Image Models

摘要

Support