MetaMath:為大型語言模型創建自己的數學問題
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models
September 21, 2023
作者: Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu
cs.AI
摘要
大型語言模型(LLMs)已推動自然語言理解的極限,展現出優秀的問題解決能力。儘管取得了巨大成功,大多數現有的開源LLMs(例如LLaMA-2)在解決數學問題方面仍然遠遠不夠滿意,這是由於複雜的推理過程。為了彌補這一差距,我們提出了MetaMath,這是一種專門用於數學推理的微調語言模型。具體而言,我們通過從多個角度重寫問題來啟動數學問題,而無需額外知識,從而產生了一個名為MetaMathQA的新數據集。然後我們在MetaMathQA上對LLaMA-2模型進行微調。在兩個流行的數學推理基準測試(即GSM8K和MATH)上的實驗結果表明,MetaMath在性能上明顯優於一系列開源LLMs。我們的MetaMath-7B模型在GSM8K上達到了66.4%,在MATH上達到了19.4%,超過了相同大小的最先進模型11.5%和8.7%。特別是,MetaMath-70B在GSM8K上實現了82.3%的準確率,略高於GPT-3.5-Turbo。我們釋放了MetaMathQA數據集,不同模型大小的MetaMath模型以及用於公眾使用的訓練代碼。
English
Large language models (LLMs) have pushed the limits of natural language
understanding and exhibited excellent problem-solving ability. Despite the
great success, most existing open-source LLMs (\eg, LLaMA-2) are still far away
from satisfactory for solving mathematical problem due to the complex reasoning
procedures. To bridge this gap, we propose MetaMath, a fine-tuned
language model that specializes in mathematical reasoning. Specifically, we
start by bootstrapping mathematical questions by rewriting the question from
multiple perspectives without extra knowledge, which results in a new dataset
called {MetaMathQA}. Then we fine-tune the LLaMA-2 models on MetaMathQA.
Experimental results on two popular benchmarks (\ie, GSM8K and MATH) for
mathematical reasoning demonstrate that MetaMath outperforms a suite of
open-source LLMs by a significant margin. Our MetaMath-7B model achieves
66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models
of the same size by 11.5% and 8.7%. Particularly, {MetaMath-70B} achieves
an accuracy of 82.3% on {GSM8K}, slightly better than {GPT-3.5-Turbo}. We
release the {MetaMathQA} dataset, the {MetaMath} models with different model
sizes and the training code for public use.