MetaMath:为大型语言模型自动生成数学问题
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
September 21, 2023
作者: Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu
cs.AI
摘要
大型语言模型(LLMs)推动了自然语言理解的极限,并展现出出色的问题解决能力。尽管取得了巨大成功,大多数现有的开源LLMs(如LLaMA-2)在解决数学问题方面仍然远未令人满意,这是由于复杂的推理过程。为了弥补这一差距,我们提出了MetaMath,这是一个专门用于数学推理的微调语言模型。具体而言,我们通过从多个角度重写问题来引导数学问题,而无需额外知识,从而产生了一个名为{MetaMathQA}的新数据集。然后我们在MetaMathQA上对LLaMA-2模型进行微调。在两个流行的数学推理基准测试(即GSM8K和MATH)上的实验结果表明,MetaMath在性能上明显优于一系列开源LLMs。我们的MetaMath-7B模型在GSM8K上达到了66.4%,在MATH上达到了19.4%,超过了相同规模的最先进模型11.5%和8.7%。特别地,{MetaMath-70B}在{GSM8K}上实现了82.3%的准确率,略优于{GPT-3.5-Turbo}。我们发布了{MetaMathQA}数据集,以及不同模型规模的{MetaMath}模型和训练代码,供公众使用。
English
Large language models (LLMs) have pushed the limits of natural language
understanding and exhibited excellent problem-solving ability. Despite the
great success, most existing open-source LLMs (\eg, LLaMA-2) are still far away
from satisfactory for solving mathematical problem due to the complex reasoning
procedures. To bridge this gap, we propose MetaMath, a fine-tuned
language model that specializes in mathematical reasoning. Specifically, we
start by bootstrapping mathematical questions by rewriting the question from
multiple perspectives without extra knowledge, which results in a new dataset
called {MetaMathQA}. Then we fine-tune the LLaMA-2 models on MetaMathQA.
Experimental results on two popular benchmarks (\ie, GSM8K and MATH) for
mathematical reasoning demonstrate that MetaMath outperforms a suite of
open-source LLMs by a significant margin. Our MetaMath-7B model achieves
66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models
of the same size by 11.5% and 8.7%. Particularly, {MetaMath-70B} achieves
an accuracy of 82.3% on {GSM8K}, slightly better than {GPT-3.5-Turbo}. We
release the {MetaMathQA} dataset, the {MetaMath} models with different model
sizes and the training code for public use.