人間と画像モデルにおけるマルチビューオブジェクトの一貫性評価

要旨

人間の観察者とビジョンモデル間の3D形状推論タスクにおける整合性を直接評価するためのベンチマークを紹介します。我々は、認知科学からの実験デザインを活用し、オブジェクトの形状に関するゼロショットの視覚推論を行います。与えられた画像セットから、被験者は、かなりの視点の違いがあるにも関わらず、同じ/異なるオブジェクトが含まれている画像を特定します。一般的なオブジェクト（例：椅子）だけでなく、抽象的な形状（つまり手続き的に生成された「ナンセンス」オブジェクト）を含む多様な画像を活用します。2000以上のユニークな画像セットを構築した後、これらのタスクを500人以上の被験者に実施し、合計35,000回の行動データを収集しました。これには明示的な選択行動だけでなく、反応時間や注視データなどの中間的な指標も含まれます。その後、一般的なビジョンモデル（例：DINOv2、MAE、CLIP）の性能を評価します。我々は、人間が全てのモデルを大きく上回ることを発見しました。マルチスケールの評価手法を用いて、モデルと人間の間の基本的な類似点と相違点を特定します。人間とモデルのパフォーマンスには相関がありますが、人間は難しい試行により多くの時間/処理を割り当てています。全ての画像、データ、コードは当社のプロジェクトページからアクセスできます。

English

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

人間と画像モデルにおけるマルチビューオブジェクトの一貫性評価

Evaluating Multiview Object Consistency in Humans and Image Models

要旨

Support