ChatPaper.aiChatPaper

GSM-Symbolic:理解大型语言模型中数学推理的局限性

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

October 7, 2024
作者: Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar
cs.AI

摘要

最近大型语言模型(LLMs)的进展引起了人们对其形式推理能力的兴趣,特别是在数学方面。GSM8K基准被广泛用于评估模型在小学水平问题上的数学推理能力。尽管近年来LLMs在GSM8K上的表现显著提高,但它们的数学推理能力是否真正进步仍不清楚,这引发了对报告指标可靠性的质疑。为了解决这些问题,我们对几种最先进的开放和封闭模型进行了大规模研究。为了克服现有评估的局限性,我们引入了GSM-Symbolic,这是一个改进的基准,由符号模板创建,可以生成多样化的问题集。GSM-Symbolic实现了更可控的评估,提供了衡量模型推理能力的关键见解和更可靠的指标。我们的研究结果显示,LLMs在回答同一问题的不同实例时表现出明显的差异。具体来说,当仅改变GSM-Symbolic基准中问题中的数值时,所有模型的表现都会下降。此外,我们调查了这些模型数学推理的脆弱性,并表明随着问题中子句数量的增加,它们的表现显著恶化。我们假设这种下降是因为当前的LLMs无法进行真正的逻辑推理;它们只是复制来自训练数据的推理步骤。即使一个似乎与问题相关的单个子句也会导致所有最先进模型的显著性能下降(高达65%),尽管该子句并不对最终答案所需的推理链有贡献。总的来说,我们的工作提供了对LLMs在数学推理中的能力和局限性更细致的理解。
English
Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

Summary

AI-Generated Summary

PDF226November 16, 2024