MathCoder2：通过在模型翻译的数学代码上持续预训练实现更好的数学推理

摘要

代码已被证明能够通过其精确性和准确性增强大型语言模型的数学推理能力。先前涉及持续数学预训练的工作通常包括使用与数学相关的软件包的代码，这些软件包主要设计用于工程、机器学习、信号处理或模块测试等领域，而非直接专注于数学推理。在本文中，我们介绍了一种新颖的方法，用于生成伴随相应推理步骤的数学代码以进行持续预训练。我们的方法始于构建一个高质量的数学持续预训练数据集，其中包括与数学相关的网络数据、使用数学软件包的代码、数学教科书和合成数据。接下来，我们通过从先前收集的数据集中提取LaTeX表达式、表达式所需的条件以及表达式的结果来构建推理步骤。基于提取的信息，我们生成相应的代码，以准确捕捉数学推理过程。将生成的代码附加到每个推理步骤中，得到了由自然语言推理步骤和其相应代码组成的数据。将这些数据与原始数据集相结合，得到了一个包含192亿标记的高性能数学预训练语料库，我们将其命名为MathCode-Pile。使用这个语料库对几个流行的基础模型进行训练显著提高了它们的数学能力，从而创建了MathCoder2系列模型。我们所有的数据处理和训练代码均已开源，确保了整个数据收集和训练流程的完全透明性和易复现性。代码已发布在https://github.com/mathllm/MathCoder2。

English

Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathllm/MathCoder2 .

MathCoder2：通过在模型翻译的数学代码上持续预训练实现更好的数学推理

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

摘要

Support