IQBench: 視覚-言語モデルはどれほど「賢い」のか？人間のIQテストを用いた研究

要旨

大規模な視覚言語モデル（VLMs）は、多様なマルチモーダルタスクにおいて顕著な性能を発揮しているものの、人間のIQテストにおける真の推論能力はまだ十分に検証されていません。VLMsの流動性知能に関する研究を進めるため、我々は標準化された視覚IQテストでVLMsを評価する新しいベンチマーク**IQBench**を導入します。我々は、最終的な予測の精度よりも、VLMsの推論能力を評価することに焦点を当てています。**我々のベンチマークは視覚中心であり、不必要なテキストコンテンツへの依存を最小化**することで、モデルが主に画像ベースの情報から答えを導き出すことを促し、学習済みのテキスト知識に頼らないように設計されています。この目的のために、我々は500の視覚IQ問題を手動で収集し、注釈を付けることで、**トレーニング中の意図しないデータ漏洩を防ぎました**。従来の研究が主に最終的な回答の精度に焦点を当てていたのに対し、我々はモデルの推論能力を評価するために、その説明と各問題を解決するために使用されたパターン、最終的な予測の精度、および人間による評価を組み合わせて分析します。我々の実験結果は、タスク間に大きな性能差があることを示しており、`o4-mini`、`gemini-2.5-flash`、`claude-3.7-sonnet`といったモデルがそれぞれ0.615、0.578、0.548の最高平均精度を達成しました。しかし、すべてのモデルが3D空間推論やアナグラム推論タスクに苦戦しており、現在のVLMsの一般的な推論能力には大きな限界があることが明らかになりました。推論スコアに関しては、`o4-mini`、`gemini-2.5-flash`、`claude-3.7-sonnet`がそれぞれ0.696、0.586、0.516のトップ平均を達成しました。これらの結果は、モデルの推論プロセスと最終的な回答の間に不一致があることを強調し、最終的な予測だけでなく、推論の精度を評価することの重要性を示しています。

English

Although large Vision-Language Models (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce **IQBench**, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. **Our benchmark is visually centric, minimizing the dependence on unnecessary textual content**, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to **prevent unintentional data leakage during training**. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, `o4-mini`, `gemini-2.5-flash`, and `claude-3.7-sonnet` achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.

IQBench: 視覚-言語モデルはどれほど「賢い」のか？人間のIQテストを用いた研究

IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests

要旨

Support