We-Math: あなたの大規模マルチモーダルモデルは人間のような数学的推論を実現していますか？

要旨

視覚的数学的推論は、基本的な視覚的推論能力として、大規模マルチモーダルモデル（LMMs）コミュニティから広く注目を集めています。既存のベンチマーク、例えばMathVistaやMathVerseは、結果指向のパフォーマンスに焦点を当てる一方で、知識獲得と一般化における基本原理を軽視しています。人間のような数学的推論にインスパイアされ、我々はエンドツーエンドのパフォーマンスを超えた問題解決の原理を探求するために特別に設計された最初のベンチマークであるWE-MATHを紹介します。我々は6.5Kの視覚的数学問題を慎重に収集し、67の階層的知識概念と5つの知識粒度層に分類しました。複合問題を必要な知識概念に従ってサブ問題に分解し、新しい四次元の指標、すなわち知識不足（IK）、不十分な一般化（IG）、完全な習得（CM）、そして丸暗記（RM）を導入して、LMMsの推論プロセスにおける内在的な問題を階層的に評価します。WE-MATHを用いて、既存のLMMsの視覚的数学的推論を徹底的に評価し、解決ステップと問題固有のパフォーマンスの間に負の相関があることを明らかにしました。LMMsのIK問題は、知識拡張戦略によって効果的に改善できることを確認しました。さらに注目すべきは、GPT-4oの主要な課題がIKからIGに大きく移行し、知識一般化段階に向かって進む最初のLMMとして確立されたことです。対照的に、他のLMMsは丸暗記への顕著な傾向を示しています—それらは複数の知識概念を含む複合問題を正しく解決する一方で、サブ問題には答えられません。我々は、WE-MATHがLMMsの視覚的数学的推論の進歩に向けた新しい道を開くことを期待しています。WE-MATHのデータと評価コードはhttps://github.com/We-Math/We-Mathで利用可能です。

English

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.

We-Math: あなたの大規模マルチモーダルモデルは人間のような数学的推論を実現していますか？

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

要旨

Support