MathScale：数学推理的指导调整规模化

摘要

大型语言模型（LLMs）展示了在解决问题方面的显著能力。然而，它们在解决数学问题方面的熟练程度仍然不足。我们提出了MathScale，这是一种简单且可扩展的方法，可以利用前沿LLMs（例如GPT-3.5）创建高质量的数学推理数据。受人类数学学习中的认知机制启发，该方法首先从种子数学问题中提取主题和知识点，然后构建概念图，随后用于生成新的数学问题。MathScale在我们生成的数学数据集的规模轴上展现出了有效的可扩展性。因此，我们创建了一个包含两百万数学问题-答案对的数学推理数据集（MathScaleQA）。为了全面评估LLMs的数学推理能力，我们构建了MwpBench，这是一个数学问题词问题基准，包含了十个数据集（包括GSM8K和MATH），涵盖了K-12、大学和竞赛级别的数学问题。我们将MathScaleQA应用于微调开源LLMs（例如LLaMA-2和Mistral），从而显著提高了数学推理能力。在MwpBench上评估，MathScale-7B在所有数据集上均实现了最先进的性能，分别比同等规模的最佳对手提高了42.9%的微平均准确率和43.7%的宏平均准确率。

English

Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.

MathScale：数学推理的指导调整规模化

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

摘要

Support