大規模言語モデルにおける数学的推論：広範な数値範囲にわたる論理および算術のエラーの評価

要旨

大規模言語モデル（LLMs）における数学的推論は、実世界の多様なスケールでの問題解決を反映しない、数値範囲が限られたベンチマークを用いて評価されることが一般的です。さらに、既存の評価方法の多くは、モデルの出力を正解と比較するだけであり、推論プロセスに関する洞察を隠蔽しています。これらの制約に対処するために、我々はGSM8Kから派生したデータセット生成器であるGSM-Rangesを導入し、数学問題における数値を系統的に変動させることで、モデルの数値スケールにわたる頑健性を評価します。さらに、論理的エラーと非論理的エラーを区別する新しい評価方法を提案し、計算の正確さを超えた推論プロセスのより正確な評価を提供します。様々なモデルを用いた実験では、数値の複雑さが増すにつれて論理的エラー率が14パーセンテージポイントまで上昇することが明らかとなり、分布外の数値を用いた推論における一般的な弱点が示されました。さらに、モデルは単独の算術タスクにおいて高い精度を示す一方で、計算が文章問題に埋め込まれた場合に性能が著しく低下します。これらの知見は、LLMsの数学的推論能力を包括的に評価し、言語モデルにおける数値の一般化を向上させるための将来の研究方向に関する示唆を提供します。

English

Mathematical reasoning in Large Language Models (LLMs) is often evaluated using benchmarks with limited numerical ranges, failing to reflect real-world problem-solving across diverse scales. Furthermore, most existing evaluation methods only compare model outputs to ground-truth answers, obscuring insights into reasoning processes. To address these limitations, we introduce GSM-Ranges, a dataset generator derived from GSM8K that systematically perturbs numerical values in math problems to assess model robustness across varying numerical scales. Additionally, we propose a novel grading methodology that distinguishes between logical and non-logical errors, offering a more precise evaluation of reasoning processes beyond computational accuracy. Our experiments with various models reveal a significant increase in logical error rates-up to 14 percentage points-as numerical complexity rises, demonstrating a general weakness in reasoning with out-of-distribution numerical values. Moreover, while models demonstrate high accuracy on standalone arithmetic tasks, their performance deteriorates substantially when computations are embedded within word problems. These findings provide a comprehensive evaluation of LLMs' mathematical reasoning capabilities and inform future research directions for improving numerical generalization in language models.

大規模言語モデルにおける数学的推論：広範な数値範囲にわたる論理および算術のエラーの評価

Mathematical Reasoning in Large Language Models: Assessing Logical and Arithmetic Errors across Wide Numerical Ranges

要旨

Support