IQBench：视觉语言模型有多“聪明”？基于人类智商测试的研究

摘要

尽管大规模视觉-语言模型（VLMs）在多种多模态任务中展现了卓越的性能，其在人类智商测试上的真实推理能力仍待深入探索。为推进VLMs流体智能的研究，我们引入了**IQBench**，这是一个旨在通过标准化视觉智商测试评估VLMs的新基准。我们着重评估VLMs的推理能力，认为这比最终预测的准确性更为重要。**我们的基准以视觉为核心，最大程度减少对不必要文本内容的依赖**，从而鼓励模型主要从图像信息中得出答案，而非依赖已学习的文本知识。为此，我们手动收集并标注了500道视觉智商题目，以**防止训练过程中无意的数据泄露**。与以往工作主要关注最终答案准确性不同，我们通过评估模型的解释及其解题模式，结合最终预测的准确性和人类评估，来衡量模型的推理能力。实验结果显示，不同任务间存在显著的性能差异，其中`o4-mini`、`gemini-2.5-flash`和`claude-3.7-sonnet`分别以0.615、0.578和0.548的平均准确率位居前列。然而，所有模型在3D空间和字谜推理任务上均表现不佳，凸显了当前VLMs在通用推理能力上的重大局限。在推理得分方面，`o4-mini`、`gemini-2.5-flash`和`claude-3.7-sonnet`分别以0.696、0.586和0.516的平均分领先。这些结果揭示了模型推理过程与最终答案之间的不一致性，强调了在评估最终预测之外，推理准确性同样重要。

English

Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.

IQBench：视觉语言模型有多“聪明”？基于人类智商测试的研究

IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests

摘要

Support