大型語言模型中的數學推理：評估跨越廣泛數值範圍的邏輯和算術錯誤

摘要

大型語言模型（LLMs）中的數學推理通常是通過具有有限數值範圍的基準來評估的，這無法反映出在不同規模下解決現實世界問題的能力。此外，大多數現有的評估方法僅將模型輸出與基本真實答案進行比較，遮蔽了對推理過程的洞察。為了解決這些限制，我們引入了GSM-Ranges，這是一個從GSM8K衍生出的數據集生成器，系統地擾亂數學問題中的數值，以評估模型對不同數值範圍的穩健性。此外，我們提出了一種新穎的評分方法，區分了邏輯和非邏輯錯誤，提供了對推理過程的更精確評估，超越了計算準確性。我們對各種模型進行的實驗顯示，在數值複雜度提高時，邏輯錯誤率顯著增加，高達14個百分點，表明在處理超出分佈範圍的數值時，推理能力存在普遍弱點。此外，儘管模型在獨立算術任務上表現出色，但當計算嵌入到文字問題中時，性能大幅下降。這些發現全面評估了LLMs的數學推理能力，並為改善語言模型中數值泛化的未來研究方向提供了信息。

English

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

大型語言模型中的數學推理：評估跨越廣泛數值範圍的邏輯和算術錯誤

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

摘要

Support