ChatPaper.aiChatPaper

MathScale:数学推理的指导调整规模化

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

March 5, 2024
作者: Zhengyang Tang, Xingxing Zhang, Benyou Wan, Furu Wei
cs.AI

摘要

大型语言模型(LLMs)展示了在解决问题方面的显著能力。然而,它们在解决数学问题方面的熟练程度仍然不足。我们提出了MathScale,这是一种简单且可扩展的方法,可以利用前沿LLMs(例如GPT-3.5)创建高质量的数学推理数据。受人类数学学习中的认知机制启发,该方法首先从种子数学问题中提取主题和知识点,然后构建概念图,随后用于生成新的数学问题。MathScale在我们生成的数学数据集的规模轴上展现出了有效的可扩展性。因此,我们创建了一个包含两百万数学问题-答案对的数学推理数据集(MathScaleQA)。为了全面评估LLMs的数学推理能力,我们构建了MwpBench,这是一个数学问题词问题基准,包含了十个数据集(包括GSM8K和MATH),涵盖了K-12、大学和竞赛级别的数学问题。我们将MathScaleQA应用于微调开源LLMs(例如LLaMA-2和Mistral),从而显著提高了数学推理能力。在MwpBench上评估,MathScale-7B在所有数据集上均实现了最先进的性能,分别比同等规模的最佳对手提高了42.9%的微平均准确率和43.7%的宏平均准确率。
English
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.
PDF172December 15, 2024