We-Math：您的大型多模态模型是否实现了类人数学推理？

摘要

作为一种基本的视觉推理能力，视觉数学推理受到了大型多模型（LMMs）社区的广泛关注。现有的基准测试，如MathVista和MathVerse，更注重结果导向的性能，但忽略了知识获取和泛化中的基本原则。受人类式数学推理启发，我们引入了WE-MATH，这是专门设计用于探索超越端到端性能的解决问题原则的第一个基准测试。我们精心收集和分类了6.5K个视觉数学问题，涵盖了67个层次化知识概念和五层知识粒度。我们根据所需的知识概念将复合问题分解为子问题，并引入了一种新颖的四维度指标，即不足知识（IK）、不充分泛化（IG）、完全掌握（CM）和死记硬背（RM），以层次化评估LMMs推理过程中的固有问题。通过WE-MATH，我们对现有的LMMs在视觉数学推理中进行了彻底评估，并揭示了解决步骤与特定问题性能之间的负相关性。我们确认LMMs的IK问题可以通过知识增补策略有效改善。更值得注意的是，GPT-4o的主要挑战已经从IK显著转变为IG，将其确立为首个朝着知识泛化阶段前进的LMM。相比之下，其他LMMs倾向于死记硬背——它们可以正确解决涉及多个知识概念的复合问题，但无法回答子问题。我们预计WE-MATH将为LMMs在视觉数学推理方面的进展开辟新途径。WE-MATH的数据和评估代码可在https://github.com/We-Math/We-Math获得。

English

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM), to hierarchically assess inherent issues in LMMs' reasoning process. With WE-MATH, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving steps and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategies. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization - they correctly solve composite problems involving multiple knowledge concepts yet fail to answer sub-problems. We anticipate that WE-MATH will open new pathways for advancements in visual mathematical reasoning for LMMs. The WE-MATH data and evaluation code are available at https://github.com/We-Math/We-Math.

We-Math：您的大型多模态模型是否实现了类人数学推理？

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

摘要

Support