GSM-Symbolic：大規模言語モデルにおける数理推論の限界の理解

要旨

最近の大規模言語モデル（LLMs）の進歩により、特に数学における形式的な推論能力に関心が集まっています。GSM8Kベンチマークは、学年レベルの問題におけるモデルの数学的推論能力を評価するために広く使用されています。LLMsのGSM8Kでの性能は近年著しく向上していますが、彼らの数学的推論能力が本当に進歩しているかは依然として不明であり、報告された指標の信頼性に疑問が投げかけられています。これらの懸念に対処するために、我々はいくつかのSOTAオープンおよびクローズドモデルに関する大規模な研究を行います。既存の評価の制約を克服するために、我々はGSM-Symbolicを導入します。これは、多様な問題の生成を可能にする象徴的なテンプレートから作成された改良されたベンチマークです。GSM-Symbolicは、よりコントロール可能な評価を実現し、モデルの推論能力を測定するための鍵となる洞察とより信頼性の高い指標を提供します。我々の調査結果によれば、LLMsは同じ問題の異なる具体例に対して応答する際に顕著なばらつきを示しています。具体的には、GSM-Symbolicベンチマークで問題の数値のみが変更された場合、すべてのモデルの性能が低下します。さらに、これらのモデルにおける数学的推論の脆弱性を調査し、問題の節の数が増加すると性能が著しく低下することを示しています。我々は、現在のLLMsが真の論理推論を行うことができないためにこの低下が起こると仮説を立てています。彼らはトレーニングデータから推論ステップを複製しているだけであり、最終的な答えに必要な推論チェーンに寄与しない1つの節を追加すると、すべての最先端モデルで性能が著しく低下します（最大65％）。総合的に、我々の研究は、LLMsの数学的推論における能力と制約についてより微妙な理解を提供しています。

English

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

GSM-Symbolic：大規模言語モデルにおける数理推論の限界の理解

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

要旨

Support