VisualPuzzles: ドメイン知識から切り離したマルチモーダル推論評価

要旨

現在のマルチモーダルベンチマークでは、推論能力とドメイン固有の知識が混同されることが多く、非専門家の設定における一般的な推論能力を分離して評価することが困難です。この問題に対処するため、私たちはVisualPuzzlesを導入しました。これは、視覚的推論に焦点を当てながら、専門知識への依存を意図的に最小化するベンチマークです。VisualPuzzlesは、アルゴリズム的、類推的、演繹的、帰納的、空間的推論の5つのカテゴリーにわたる多様な問題で構成されています。問題の主要なソースの一つは、中国の国家公務員試験から手動で翻訳された論理的推論問題です。実験結果から、VisualPuzzlesはMMMUなどのベンチマークと比較して、ドメイン固有の知識への依存が大幅に少なく、より複雑な推論を必要とすることが示されており、真のマルチモーダル推論をより適切に評価することが可能です。評価結果から、最先端のマルチモーダル大規模言語モデルは、VisualPuzzlesにおいて一貫して人間のパフォーマンスに及ばないことが明らかになりました。また、知識集約型のベンチマークでの強力なパフォーマンスが、必ずしも推論中心で知識軽量なタスクでの成功に繋がらないことも示されています。さらに、推論の強化（「思考」モードを用いた推論計算のスケールアップなど）は、モデルやタスクタイプによって一貫した効果をもたらさず、モデルのサイズとパフォーマンスの間に明確な相関関係は観察されませんでした。また、VisualPuzzlesでは、知識に重点を置いたベンチマークとは異なる推論と回答パターンがモデルに現れることも確認されました。VisualPuzzlesは、事実の記憶やドメイン知識を超えた推論能力を評価するためのより明確な視点を提供します。

English

Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

VisualPuzzles: ドメイン知識から切り離したマルチモーダル推論評価

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

要旨

Support