大型語言模型中的數學推理:評估跨越廣泛數值範圍的邏輯和算術錯誤
Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges
February 12, 2025
作者: Safal Shrestha, Minwu Kim, Keith Ross
cs.AI
摘要
大型語言模型(LLMs)中的數學推理通常是通過具有有限數值範圍的基準來評估的,這無法反映出在不同規模下解決現實世界問題的能力。此外,大多數現有的評估方法僅將模型輸出與基本真實答案進行比較,遮蔽了對推理過程的洞察。為了解決這些限制,我們引入了GSM-Ranges,這是一個從GSM8K衍生出的數據集生成器,系統地擾亂數學問題中的數值,以評估模型對不同數值範圍的穩健性。此外,我們提出了一種新穎的評分方法,區分了邏輯和非邏輯錯誤,提供了對推理過程的更精確評估,超越了計算準確性。我們對各種模型進行的實驗顯示,在數值複雜度提高時,邏輯錯誤率顯著增加,高達14個百分點,表明在處理超出分佈範圍的數值時,推理能力存在普遍弱點。此外,儘管模型在獨立算術任務上表現出色,但當計算嵌入到文字問題中時,性能大幅下降。這些發現全面評估了LLMs的數學推理能力,並為改善語言模型中數值泛化的未來研究方向提供了信息。
English
Mathematical reasoning in Large Language Models (LLMs) is often evaluated
using benchmarks with limited numerical ranges, failing to reflect real-world
problem-solving across diverse scales. Furthermore, most existing evaluation
methods only compare model outputs to ground-truth answers, obscuring insights
into reasoning processes. To address these limitations, we introduce
GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs
numerical values in math problems to assess model robustness across varying
numerical scales. Additionally, we propose a novel grading methodology that
distinguishes between logical and non-logical errors, offering a more precise
evaluation of reasoning processes beyond computational accuracy. Our
experiments with various models reveal a significant increase in logical error
rates-up to 14 percentage points-as numerical complexity rises, demonstrating a
general weakness in reasoning with out-of-distribution numerical values.
Moreover, while models demonstrate high accuracy on standalone arithmetic
tasks, their performance deteriorates substantially when computations are
embedded within word problems. These findings provide a comprehensive
evaluation of LLMs' mathematical reasoning capabilities and inform future
research directions for improving numerical generalization in language models.Summary
AI-Generated Summary