MathScale:数学推理的指导调整规模化
MathScale: Scaling Instruction Tuning for Mathematical Reasoning
March 5, 2024
作者: Zhengyang Tang, Xingxing Zhang, Benyou Wan, Furu Wei
cs.AI
摘要
大型语言模型(LLMs)展示了在解决问题方面的显著能力。然而,它们在解决数学问题方面的熟练程度仍然不足。我们提出了MathScale,这是一种简单且可扩展的方法,可以利用前沿LLMs(例如GPT-3.5)创建高质量的数学推理数据。受人类数学学习中的认知机制启发,该方法首先从种子数学问题中提取主题和知识点,然后构建概念图,随后用于生成新的数学问题。MathScale在我们生成的数学数据集的规模轴上展现出了有效的可扩展性。因此,我们创建了一个包含两百万数学问题-答案对的数学推理数据集(MathScaleQA)。为了全面评估LLMs的数学推理能力,我们构建了MwpBench,这是一个数学问题词问题基准,包含了十个数据集(包括GSM8K和MATH),涵盖了K-12、大学和竞赛级别的数学问题。我们将MathScaleQA应用于微调开源LLMs(例如LLaMA-2和Mistral),从而显著提高了数学推理能力。在MwpBench上评估,MathScale-7B在所有数据集上均实现了最先进的性能,分别比同等规模的最佳对手提高了42.9%的微平均准确率和43.7%的宏平均准确率。
English
Large language models (LLMs) have demonstrated remarkable capabilities in
problem-solving. However, their proficiency in solving mathematical problems
remains inadequate. We propose MathScale, a simple and scalable method to
create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt
GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning,
it first extracts topics and knowledge points from seed math questions and then
build a concept graph, which is subsequently used to generate new math
questions. MathScale exhibits effective scalability along the size axis of the
math dataset that we generate. As a result, we create a mathematical reasoning
dataset (MathScaleQA) containing two million math question-answer pairs. To
evaluate mathematical reasoning abilities of LLMs comprehensively, we construct
{\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten
datasets (including GSM8K and MATH) covering K-12, college, and competition
level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g.,
LLaMA-2 and Mistral), resulting in significantly improved capabilities in
mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves
state-of-the-art performance across all datasets, surpassing its best peers of
equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average
accuracy, respectively.