IQBench：視覺語言模型有多「聰明」？基於人類智商測試的研究

摘要

儘管大型視覺語言模型（VLMs）在多模態任務中展現了卓越的性能，但其在人類智商測試上的真正推理能力仍未被充分探索。為了推進對VLMs流體智能的研究，我們引入了**IQBench**，這是一個旨在評估VLMs在標準化視覺智商測試上表現的新基準。我們專注於評估VLMs的推理能力，認為這比最終預測的準確性更為重要。**我們的基準以視覺為中心，最大限度地減少對不必要文本內容的依賴**，從而鼓勵模型主要從圖像信息中推導答案，而非依賴於學習到的文本知識。為此，我們手動收集並註釋了500道視覺智商問題，以**防止訓練過程中無意的數據洩露**。與以往主要關注最終答案準確性的研究不同，我們通過評估模型的解釋和解決每個問題所使用的模式，以及最終預測的準確性和人類評估，來評估模型的推理能力。我們的實驗顯示，不同任務之間存在顯著的性能差異，其中`o4-mini`、`gemini-2.5-flash`和`claude-3.7-sonnet`模型分別達到了0.615、0.578和0.548的最高平均準確率。然而，所有模型在3D空間和字謎推理任務上都表現不佳，這凸顯了當前VLMs在一般推理能力上的顯著局限性。在推理得分方面，`o4-mini`、`gemini-2.5-flash`和`claude-3.7-sonnet`分別以0.696、0.586和0.516的平均分位居前列。這些結果揭示了模型推理過程與最終答案之間的不一致性，強調了在評估最終預測之外，推理準確性的重要性。

English

Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.

IQBench：視覺語言模型有多「聰明」？基於人類智商測試的研究

IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests

摘要

Support