IQBench: 시각-언어 모델은 얼마나 "똑똑"한가? 인간 IQ 테스트를 통한 연구

초록

대규모 시각-언어 모델(VLMs)이 다양한 멀티모달 작업에서 놀라운 성능을 보여주고 있지만, 인간 IQ 테스트에서의 실제 추론 능력은 아직 충분히 탐구되지 않았습니다. VLMs의 유동적 지능에 대한 연구를 발전시키기 위해, 우리는 표준화된 시각적 IQ 테스트를 통해 VLMs를 평가하기 위한 새로운 벤치마크인 **IQBench**를 소개합니다. 우리는 최종 예측의 정확성보다 더 중요한 VLMs의 추론 능력 평가에 초점을 맞추었습니다. **우리의 벤치마크는 시각 중심으로 설계되어 불필요한 텍스트 내용에 대한 의존성을 최소화함으로써**, 모델이 학습된 텍스트 지식보다는 이미지 기반 정보에서 답을 도출하도록 장려합니다. 이를 위해, 우리는 훈련 중 의도치 않은 데이터 누출을 방지하기 위해 500개의 시각적 IQ 질문을 수동으로 수집하고 주석을 달았습니다. 최종 답변의 정확성에 주로 초점을 맞춘 기존 연구와 달리, 우리는 모델의 설명과 각 문제를 해결하는 데 사용된 패턴을 평가함으로써 모델의 추론 능력을 평가하며, 최종 예측의 정확성과 인간 평가도 함께 고려합니다. 우리의 실험 결과, 작업 간에 상당한 성능 차이가 있음을 보여주며, `o4-mini`, `gemini-2.5-flash`, `claude-3.7-sonnet`와 같은 모델이 각각 0.615, 0.578, 0.548의 최고 평균 정확도를 달성했습니다. 그러나 모든 모델이 3D 공간 및 아나그램 추론 작업에서 어려움을 겪으며, 현재 VLMs의 일반적인 추론 능력에 상당한 한계가 있음을 강조합니다. 추론 점수 측면에서, `o4-mini`, `gemini-2.5-flash`, `claude-3.7-sonnet`는 각각 0.696, 0.586, 0.516의 최고 평균을 기록했습니다. 이러한 결과는 모델의 추론 과정과 최종 답변 간의 불일치를 강조하며, 최종 예측뿐만 아니라 추론의 정확성을 평가하는 것의 중요성을 부각시킵니다.

English

Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.

IQBench: 시각-언어 모델은 얼마나 "똑똑"한가? 인간 IQ 테스트를 통한 연구

IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests

초록

Support