MathCoder2:通过在模型翻译的数学代码上持续预训练实现更好的数学推理
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
October 10, 2024
作者: Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li
cs.AI
摘要
代码已被证明能够通过其精确性和准确性增强大型语言模型的数学推理能力。先前涉及持续数学预训练的工作通常包括使用与数学相关的软件包的代码,这些软件包主要设计用于工程、机器学习、信号处理或模块测试等领域,而非直接专注于数学推理。在本文中,我们介绍了一种新颖的方法,用于生成伴随相应推理步骤的数学代码以进行持续预训练。我们的方法始于构建一个高质量的数学持续预训练数据集,其中包括与数学相关的网络数据、使用数学软件包的代码、数学教科书和合成数据。接下来,我们通过从先前收集的数据集中提取LaTeX表达式、表达式所需的条件以及表达式的结果来构建推理步骤。基于提取的信息,我们生成相应的代码,以准确捕捉数学推理过程。将生成的代码附加到每个推理步骤中,得到了由自然语言推理步骤和其相应代码组成的数据。将这些数据与原始数据集相结合,得到了一个包含192亿标记的高性能数学预训练语料库,我们将其命名为MathCode-Pile。使用这个语料库对几个流行的基础模型进行训练显著提高了它们的数学能力,从而创建了MathCoder2系列模型。我们所有的数据处理和训练代码均已开源,确保了整个数据收集和训练流程的完全透明性和易复现性。代码已发布在https://github.com/mathllm/MathCoder2。
English
Code has been shown to be effective in enhancing the mathematical reasoning
abilities of large language models due to its precision and accuracy. Previous
works involving continued mathematical pretraining often include code that
utilizes math-related packages, which are primarily designed for fields such as
engineering, machine learning, signal processing, or module testing, rather
than being directly focused on mathematical reasoning. In this paper, we
introduce a novel method for generating mathematical code accompanied with
corresponding reasoning steps for continued pretraining. Our approach begins
with the construction of a high-quality mathematical continued pretraining
dataset by incorporating math-related web data, code using mathematical
packages, math textbooks, and synthetic data. Next, we construct reasoning
steps by extracting LaTeX expressions, the conditions needed for the
expressions, and the results of the expressions from the previously collected
dataset. Based on this extracted information, we generate corresponding code to
accurately capture the mathematical reasoning process. Appending the generated
code to each reasoning step results in data consisting of paired natural
language reasoning steps and their corresponding code. Combining this data with
the original dataset results in a 19.2B-token high-performing mathematical
pretraining corpus, which we name MathCode-Pile. Training several popular base
models with this corpus significantly improves their mathematical abilities,
leading to the creation of the MathCoder2 family of models. All of our data
processing and training code is open-sourced, ensuring full transparency and
easy reproducibility of the entire data collection and training pipeline. The
code is released at https://github.com/mathllm/MathCoder2 .Summary
AI-Generated Summary