MathScale: 수학적 추론을 위한 명령어 튜닝 확장

초록

대규모 언어 모델(LLM)은 문제 해결에서 놀라운 능력을 보여주고 있습니다. 그러나 수학 문제 해결 능력은 여전히 부족한 상태입니다. 우리는 최첨단 LLM(예: {\tt GPT-3.5})을 사용하여 고품질의 수학적 추론 데이터를 생성하는 간단하고 확장 가능한 방법인 MathScale을 제안합니다. 이 방법은 인간의 수학 학습에서의 인지 메커니즘에서 영감을 받아, 먼저 시드 수학 문제에서 주제와 지식 포인트를 추출한 후 개념 그래프를 구축하고, 이를 통해 새로운 수학 문제를 생성합니다. MathScale은 우리가 생성한 수학 데이터셋의 크기 축을 따라 효과적인 확장성을 보여줍니다. 결과적으로, 우리는 200만 개의 수학 질문-답변 쌍을 포함하는 수학적 추론 데이터셋(MathScaleQA)을 생성했습니다. LLM의 수학적 추론 능력을 종합적으로 평가하기 위해, K-12, 대학, 그리고 경시대회 수준의 수학 문제를 포함하는 10개의 데이터셋(예: GSM8K 및 MATH)으로 구성된 수학 단어 문제 벤치마크인 {\sc MwpBench}을 구축했습니다. 우리는 MathScaleQA를 오픈소스 LLM(예: LLaMA-2 및 Mistral)의 미세 조정에 적용하여 수학적 추론 능력을 크게 향상시켰습니다. {\sc MwpBench}에서 평가한 결과, MathScale-7B는 모든 데이터셋에서 최첨단 성능을 달성하며, 동일한 크기의 최고의 경쟁 모델을 마이크로 평균 정확도에서 42.9\%, 매크로 평균 정확도에서 43.7\% 각각 능가했습니다.

English

Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.

MathScale: 수학적 추론을 위한 명령어 튜닝 확장

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

초록

Support