인간과 이미지 모델에서 다중뷰 객체 일관성 평가

초록

우리는 인간 관찰자와 시각 모델 간의 3D 형상 추론 작업에서 정렬을 직접 평가하는 벤치마크를 소개합니다. 우리는 인지과학에서의 실험적 설계를 활용하여, 객체 형상에 대한 제로샷 시각 추론을 요구합니다: 이미지 세트가 주어지면 참가자들은 상당한 시각 관점 변화에도 불구하고 동일한/다른 객체를 포함하는 이미지를 식별합니다. 우리는 의자와 같은 일반적인 객체뿐만 아니라 절차적으로 생성된 '무의미한' 객체인 추상 형상을 포함하는 다양한 이미지를 활용합니다. 2000개 이상의 고유한 이미지 세트를 구성한 후, 이러한 작업을 인간 참가자들에게 시행하여 500명 이상의 참가자로부터 35,000회의 행동 데이터를 수집합니다. 이는 명시적 선택 행동뿐만 아니라 반응 시간 및 시선 데이터와 같은 중간 측정값을 포함합니다. 그런 다음 일반적인 시각 모델(DINOv2, MAE, CLIP 등)의 성능을 평가합니다. 우리는 인간이 모든 모델을 큰 폭으로 능가한다는 결과를 발견했습니다. 다중 규모 평가 방법을 사용하여 모델과 인간 사이의 근본적인 유사점과 차이점을 식별합니다: 인간-모델 성능은 상관관계가 있지만, 인간은 어려운 시행에 더 많은 시간/처리를 할당합니다. 모든 이미지, 데이터 및 코드는 우리의 프로젝트 페이지를 통해 액세스할 수 있습니다.

English

We introduce a benchmark to directly evaluate the alignment between human observers and vision models on a 3D shape inference task. We leverage an experimental design from the cognitive sciences which requires zero-shot visual inferences about object shape: given a set of images, participants identify which contain the same/different objects, despite considerable viewpoint variation. We draw from a diverse range of images that include common objects (e.g., chairs) as well as abstract shapes (i.e., procedurally generated `nonsense' objects). After constructing over 2000 unique image sets, we administer these tasks to human participants, collecting 35K trials of behavioral data from over 500 participants. This includes explicit choice behaviors as well as intermediate measures, such as reaction time and gaze data. We then evaluate the performance of common vision models (e.g., DINOv2, MAE, CLIP). We find that humans outperform all models by a wide margin. Using a multi-scale evaluation approach, we identify underlying similarities and differences between models and humans: while human-model performance is correlated, humans allocate more time/processing on challenging trials. All images, data, and code can be accessed via our project page.

인간과 이미지 모델에서 다중뷰 객체 일관성 평가

Evaluating Multiview Object Consistency in Humans and Image Models

초록

Support