您的模型真的是一个优秀的数学推理者吗?使用清单评估数学推理
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
July 11, 2024
作者: Zihao Zhou, Shudong Liu, Maizhen Ning, Wei Liu, Jindong Wang, Derek F. Wong, Xiaowei Huang, Qiufeng Wang, Kaizhu Huang
cs.AI
摘要
出色的数学推理能力是展示大型语言模型(LLMs)强大能力的关键特征之一。如何全面定义和评估LLMs的数学能力,甚至反映用户在现实场景中的体验,已成为一个关键问题。当前的基准主要集中在解决问题的能力上,这存在模型过拟合的重大风险,并未准确代表真正的数学推理能力。在本文中,我们认为,如果一个模型真正理解了一个问题,它应该能够稳健且轻松地应用于各种任务。受此启发,我们引入了MATHCHECK,一个旨在测试任务泛化和推理稳健性的精心设计的检查表,以及一个能够高效生成检查表的自动化工具。MATHCHECK包括多个数学推理任务和稳健性测试类型,以促进对数学推理能力和行为测试的全面评估。利用MATHCHECK,我们开发了MATHCHECK-GSM和MATHCHECK-GEO,分别用于评估数学文本推理和多模态推理能力,作为GSM8k、GeoQA、UniGeo和Geometry3K等基准的升级版本。我们采用MATHCHECK-GSM和MATHCHECK-GEO来评估超过20个LLMs和11个MLLMs,评估它们的全面数学推理能力。我们的结果表明,虽然像GPT-4o这样的前沿LLMs在检查表上继续在各种能力上表现出色,但许多其他模型家族表现出显著下降。进一步的实验表明,与传统数学基准相比,MATHCHECK更好地反映了真实的数学能力,并更线性地代表了数学智能,从而支持我们的设计。在我们的MATHCHECK上,我们可以轻松进行详细的行为分析,以深入研究模型。
English
Exceptional mathematical reasoning ability is one of the key features that
demonstrate the power of large language models (LLMs). How to comprehensively
define and evaluate the mathematical abilities of LLMs, and even reflect the
user experience in real-world scenarios, has emerged as a critical issue.
Current benchmarks predominantly concentrate on problem-solving capabilities,
which presents a substantial risk of model overfitting and fails to accurately
represent genuine mathematical reasoning abilities. In this paper, we argue
that if a model really understands a problem, it should be robustly and readily
applied across a diverse array of tasks. Motivated by this, we introduce
MATHCHECK, a well-designed checklist for testing task generalization and
reasoning robustness, as well as an automatic tool to generate checklists
efficiently. MATHCHECK includes multiple mathematical reasoning tasks and
robustness test types to facilitate a comprehensive evaluation of both
mathematical reasoning ability and behavior testing. Utilizing MATHCHECK, we
develop MATHCHECK-GSM and MATHCHECK-GEO to assess mathematical textual
reasoning and multi-modal reasoning capabilities, respectively, serving as
upgraded versions of benchmarks including GSM8k, GeoQA, UniGeo, and Geometry3K.
We adopt MATHCHECK-GSM and MATHCHECK-GEO to evaluate over 20 LLMs and 11 MLLMs,
assessing their comprehensive mathematical reasoning abilities. Our results
demonstrate that while frontier LLMs like GPT-4o continue to excel in various
abilities on the checklist, many other model families exhibit a significant
decline. Further experiments indicate that, compared to traditional math
benchmarks, MATHCHECK better reflects true mathematical abilities and
represents mathematical intelligence more linearly, thereby supporting our
design. On our MATHCHECK, we can easily conduct detailed behavior analysis to
deeply investigate models.Summary
AI-Generated Summary