GSM-Symbolic: 대규모 언어 모델에서 수학적 추론의 한계 이해하기

초록

최근 대형 언어 모델(Large Language Models, LLMs)의 발전은 그들의 형식적 추론 능력에 대한 수학적 관심을 불러일으켰다. GSM8K 벤치마크는 초등학교 수준 문제에 대한 모델의 수학적 추론 능력을 평가하는 데 널리 사용된다. LLMs의 GSM8K 성능은 최근 몇 년 동안 크게 향상되었지만, 그들의 수학적 추론 능력이 실제로 발전했는지 여전히 불분명하며, 보고된 지표의 신뢰성에 대한 의문이 제기된다. 이러한 우려를 해소하기 위해 우리는 여러 최첨단 오픈 및 폐쇄 모델에 대한 대규모 연구를 실시한다. 기존 평가의 한계를 극복하기 위해 우리는 심볼 템플릿에서 생성되는 다양한 질문 세트를 가능하게 하는 개선된 벤치마크인 GSM-Symbolic을 소개한다. GSM-Symbolic은 더 많은 통제 가능한 평가를 가능하게 하며, 모델의 추론 능력을 측정하기 위한 핵심 통찰과 더 신뢰할 수 있는 지표를 제공한다. 우리의 연구 결과는 LLMs가 동일한 질문의 다른 구체화에 대해 응답할 때 주목할 만한 변동성을 나타내는 것을 보여준다. 특히, GSM-Symbolic 벤치마크에서 질문의 숫자 값만 변경되었을 때 모든 모델의 성능이 저하된다. 더 나아가, 우리는 이러한 모델들의 수학적 추론의 취약성을 조사하고, 질문의 절의 수가 증가함에 따라 그들의 성능이 크게 악화되는 것을 보여준다. 우리는 현재 LLMs가 진정한 논리적 추론을 수행할 수 없기 때문에 이 하락이 발생한다고 가설을 세운다; 그들은 훈련 데이터에서 추론 단계를 복제한다. 질문에 관련이 있는 것으로 보이는 단일 절을 추가하면 최첨단 모델 전체에서 상당한 성능 하락(최대 65%)이 발생한다. 비록 그 절이 최종 답변에 필요한 추론 체인에 기여하지 않더라도 말이다. 전반적으로, 우리의 연구는 LLMs의 수학적 추론 능력과 한계에 대한 더 세밀한 이해를 제공한다.

English

Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.

GSM-Symbolic: 대규모 언어 모델에서 수학적 추론의 한계 이해하기

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

초록

Support