你的模型真的是一個優秀的數學推理者嗎？使用檢查表評估數學推理

摘要

卓越的數學推理能力是展示大型語言模型（LLMs）威力的關鍵特徵之一。如何全面定義和評估LLMs的數學能力，甚至反映在現實場景中的用戶體驗，已成為一個關鍵問題。目前的基準主要集中在解決問題的能力上，這帶來了模型過度擬合的風險，並未能準確地代表真正的數學推理能力。在本文中，我們認為如果一個模型真正理解了一個問題，它應該能夠堅固而迅速地應用於各種任務。受此啟發，我們引入了MATHCHECK，這是一個為測試任務泛化和推理韌性而設計的清單，以及一個能夠高效生成檢查清單的自動工具。MATHCHECK包括多個數學推理任務和韌性測試類型，以促進對數學推理能力和行為測試的全面評估。利用MATHCHECK，我們開發了MATHCHECK-GSM和MATHCHECK-GEO，分別用於評估數學文本推理和多模態推理能力，作為GSM8k、GeoQA、UniGeo和Geometry3K等基準的升級版本。我們採用MATHCHECK-GSM和MATHCHECK-GEO來評估超過20個LLMs和11個MLLMs，評估它們的全面數學推理能力。我們的結果顯示，儘管像GPT-4o這樣的前沿LLMs在檢查清單上繼續擅長各種能力，但許多其他模型家族表現出顯著下降。進一步的實驗表明，與傳統數學基準相比，MATHCHECK更好地反映了真正的數學能力，並更線性地代表了數學智能，從而支持我們的設計。在我們的MATHCHECK上，我們可以輕鬆進行詳細的行為分析，以深入研究模型。

English

Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, which presents a substantial risk of model overfitting and fails to accurately represent genuine mathematical reasoning abilities. In this paper, we argue that if a model really understands a problem, it should be robustly and readily applied across a diverse array of tasks. Motivated by this, we introduce MATHCHECK, a well-designed checklist for testing task generalization and reasoning robustness, as well as an automatic tool to generate checklists efficiently. MATHCHECK includes multiple mathematical reasoning tasks and robustness test types to facilitate a comprehensive evaluation of both mathematical reasoning ability and behavior testing. Utilizing MATHCHECK, we develop MATHCHECK-GSM and MATHCHECK-GEO to assess mathematical textual reasoning and multi-modal reasoning capabilities, respectively, serving as upgraded versions of benchmarks including GSM8k, GeoQA, UniGeo, and Geometry3K. We adopt MATHCHECK-GSM and MATHCHECK-GEO to evaluate over 20 LLMs and 11 MLLMs, assessing their comprehensive mathematical reasoning abilities. Our results demonstrate that while frontier LLMs like GPT-4o continue to excel in various abilities on the checklist, many other model families exhibit a significant decline. Further experiments indicate that, compared to traditional math benchmarks, MATHCHECK better reflects true mathematical abilities and represents mathematical intelligence more linearly, thereby supporting our design. On our MATHCHECK, we can easily conduct detailed behavior analysis to deeply investigate models.

你的模型真的是一個優秀的數學推理者嗎？使用檢查表評估數學推理

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

摘要

Support