당신의 모델은 정말 훌륭한 수학적 추론기인가? 체크리스트를 활용한 수학적 추론 평가

초록

탁월한 수학적 추론 능력은 대규모 언어 모델(LLMs)의 위력을 보여주는 핵심 특징 중 하나입니다. LLMs의 수학적 능력을 포괄적으로 정의하고 평가하며, 실제 시나리오에서의 사용자 경험을 반영하는 방법은 중요한 문제로 대두되고 있습니다. 현재 벤치마크는 주로 문제 해결 능력에 초점을 맞추고 있어, 모델의 과적합 위험이 크고 진정한 수학적 추론 능력을 정확히 반영하지 못하는 한계가 있습니다. 본 논문에서는 모델이 문제를 진정으로 이해한다면, 다양한 작업에 견고하고 쉽게 적용될 수 있어야 한다는 점을 주장합니다. 이를 바탕으로 우리는 작업 일반화와 추론 견고성을 테스트하기 위해 잘 설계된 체크리스트인 MATHCHECK와 이를 효율적으로 생성하는 자동화 도구를 소개합니다. MATHCHECK는 다양한 수학적 추론 작업과 견고성 테스트 유형을 포함하여 수학적 추론 능력과 행동 테스트를 포괄적으로 평가할 수 있도록 합니다. MATHCHECK를 활용하여, 우리는 수학적 텍스트 추론 능력을 평가하는 MATHCHECK-GSM과 다중 모달 추론 능력을 평가하는 MATHCHECK-GEO를 개발했습니다. 이들은 GSM8k, GeoQA, UniGeo, Geometry3K 등의 벤치마크를 업그레이드한 버전으로서의 역할을 합니다. 우리는 MATHCHECK-GSM과 MATHCHECK-GEO를 사용하여 20개 이상의 LLMs와 11개의 MLLMs를 평가하며, 그들의 포괄적인 수학적 추론 능력을 평가했습니다. 결과는 GPT-4o와 같은 최첨단 LLMs가 체크리스트의 다양한 능력에서 계속 우수한 성과를 보이는 반면, 많은 다른 모델 패밀리에서는 상당한 성능 저하가 나타남을 보여줍니다. 추가 실험은 전통적인 수학 벤치마크와 비교하여 MATHCHECK가 진정한 수학적 능력을 더 잘 반영하고 수학적 지능을 더 선형적으로 나타내며, 이는 우리의 설계를 뒷받침함을 보여줍니다. 우리의 MATHCHECK를 통해, 우리는 모델을 깊이 있게 조사하기 위해 상세한 행동 분석을 쉽게 수행할 수 있습니다.

English

Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user experience in real-world scenarios, has emerged as a critical issue. Current benchmarks predominantly concentrate on problem-solving capabilities, which presents a substantial risk of model overfitting and fails to accurately represent genuine mathematical reasoning abilities. In this paper, we argue that if a model really understands a problem, it should be robustly and readily applied across a diverse array of tasks. Motivated by this, we introduce MATHCHECK, a well-designed checklist for testing task generalization and reasoning robustness, as well as an automatic tool to generate checklists efficiently. MATHCHECK includes multiple mathematical reasoning tasks and robustness test types to facilitate a comprehensive evaluation of both mathematical reasoning ability and behavior testing. Utilizing MATHCHECK, we develop MATHCHECK-GSM and MATHCHECK-GEO to assess mathematical textual reasoning and multi-modal reasoning capabilities, respectively, serving as upgraded versions of benchmarks including GSM8k, GeoQA, UniGeo, and Geometry3K. We adopt MATHCHECK-GSM and MATHCHECK-GEO to evaluate over 20 LLMs and 11 MLLMs, assessing their comprehensive mathematical reasoning abilities. Our results demonstrate that while frontier LLMs like GPT-4o continue to excel in various abilities on the checklist, many other model families exhibit a significant decline. Further experiments indicate that, compared to traditional math benchmarks, MATHCHECK better reflects true mathematical abilities and represents mathematical intelligence more linearly, thereby supporting our design. On our MATHCHECK, we can easily conduct detailed behavior analysis to deeply investigate models.

당신의 모델은 정말 훌륭한 수학적 추론기인가? 체크리스트를 활용한 수학적 추론 평가

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

초록

Support