MetaMath：為大型語言模型創建自己的數學問題

摘要

大型語言模型（LLMs）已推動自然語言理解的極限，展現出優秀的問題解決能力。儘管取得了巨大成功，大多數現有的開源LLMs（例如LLaMA-2）在解決數學問題方面仍然遠遠不夠滿意，這是由於複雜的推理過程。為了彌補這一差距，我們提出了MetaMath，這是一種專門用於數學推理的微調語言模型。具體而言，我們通過從多個角度重寫問題來啟動數學問題，而無需額外知識，從而產生了一個名為MetaMathQA的新數據集。然後我們在MetaMathQA上對LLaMA-2模型進行微調。在兩個流行的數學推理基準測試（即GSM8K和MATH）上的實驗結果表明，MetaMath在性能上明顯優於一系列開源LLMs。我們的MetaMath-7B模型在GSM8K上達到了66.4％，在MATH上達到了19.4％，超過了相同大小的最先進模型11.5％和8.7％。特別是，MetaMath-70B在GSM8K上實現了82.3％的準確率，略高於GPT-3.5-Turbo。我們釋放了MetaMathQA數據集，不同模型大小的MetaMath模型以及用於公眾使用的訓練代碼。

English

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (\eg, LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called {MetaMathQA}. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (\ie, GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, {MetaMath-70B} achieves an accuracy of 82.3% on {GSM8K}, slightly better than {GPT-3.5-Turbo}. We release the {MetaMathQA} dataset, the {MetaMath} models with different model sizes and the training code for public use.

MetaMath：為大型語言模型創建自己的數學問題

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

摘要

Support