MathCoder2：在模型翻譯的數學代碼上持續預訓練以提升數學推理

摘要

代碼已被證明對於增強大型語言模型的數學推理能力具有效果，這是由於其精確性和準確性。先前涉及持續數學預訓練的作品通常包括使用與數學相關的套件的代碼，這些套件主要設計用於工程、機器學習、信號處理或模組測試等領域，而非直接專注於數學推理。在本文中，我們介紹了一種新方法，用於生成伴隨相應推理步驟的數學代碼以進行持續預訓練。我們的方法始於通過合併與數學相關的網絡數據、使用數學套件的代碼、數學教科書和合成數據構建高質量的數學持續預訓練數據集。接下來，我們通過從先前收集的數據集中提取LaTeX表達式、表達式所需的條件以及表達式的結果來構建推理步驟。基於這些提取的信息，我們生成相應的代碼，以準確捕捉數學推理過程。將生成的代碼附加到每個推理步驟中，形成由配對的自然語言推理步驟和相應代碼組成的數據。將此數據與原始數據集結合，形成一個包含192億令牌的高性能數學預訓練語料庫，我們將其命名為MathCode-Pile。使用這個語料庫訓練幾個流行的基礎模型顯著提升了它們的數學能力，從而創建了MathCoder2系列模型。我們所有的數據處理和訓練代碼均為開源，確保整個數據收集和訓練流程的完全透明性和易於重現性。代碼已在https://github.com/mathllm/MathCoder2 上釋出。

English

Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathllm/MathCoder2 .

MathCoder2：在模型翻譯的數學代碼上持續預訓練以提升數學推理

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

摘要

Support