ChatPaper.aiChatPaper

MathScale:針對數學推理的指令調整進行擴展

MathScale: Scaling Instruction Tuning for Mathematical Reasoning

March 5, 2024
作者: Zhengyang Tang, Xingxing Zhang, Benyou Wan, Furu Wei
cs.AI

摘要

大型語言模型(LLMs)展示了在解決問題方面的卓越能力。然而,它們在解決數學問題方面的熟練度仍然不足。我們提出了MathScale,這是一種簡單且可擴展的方法,使用前沿的LLMs(例如GPT-3.5)來創建高質量的數學推理數據。受人類數學學習中的認知機制啟發,該方法首先從種子數學問題中提取主題和知識點,然後構建概念圖,隨後用於生成新的數學問題。MathScale在我們生成的數學數據集的大小軸上展現出有效的可擴展性。因此,我們創建了一個包含兩百萬個數學問題-答案對的數學推理數據集(MathScaleQA)。為了全面評估LLMs的數學推理能力,我們構建了MwpBench,這是一個數學文字問題基準測試,其中包括十個數據集(包括GSM8K和MATH),涵蓋K-12、大學和競賽級別的數學問題。我們將MathScaleQA應用於微調開源LLMs(例如LLaMA-2和Mistral),從而顯著提高了數學推理能力。在MwpBench上評估,MathScale-7B在所有數據集上均實現了最先進的性能,分別在微平均準確度和宏平均準確度上超過同等大小的最佳對手42.9%和43.7%。
English
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.
PDF172December 15, 2024