MathCoder2:在模型翻譯的數學代碼上持續預訓練以提升數學推理
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code
October 10, 2024
作者: Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li
cs.AI
摘要
代碼已被證明對於增強大型語言模型的數學推理能力具有效果,這是由於其精確性和準確性。先前涉及持續數學預訓練的作品通常包括使用與數學相關的套件的代碼,這些套件主要設計用於工程、機器學習、信號處理或模組測試等領域,而非直接專注於數學推理。在本文中,我們介紹了一種新方法,用於生成伴隨相應推理步驟的數學代碼以進行持續預訓練。我們的方法始於通過合併與數學相關的網絡數據、使用數學套件的代碼、數學教科書和合成數據構建高質量的數學持續預訓練數據集。接下來,我們通過從先前收集的數據集中提取LaTeX表達式、表達式所需的條件以及表達式的結果來構建推理步驟。基於這些提取的信息,我們生成相應的代碼,以準確捕捉數學推理過程。將生成的代碼附加到每個推理步驟中,形成由配對的自然語言推理步驟和相應代碼組成的數據。將此數據與原始數據集結合,形成一個包含192億令牌的高性能數學預訓練語料庫,我們將其命名為MathCode-Pile。使用這個語料庫訓練幾個流行的基礎模型顯著提升了它們的數學能力,從而創建了MathCoder2系列模型。我們所有的數據處理和訓練代碼均為開源,確保整個數據收集和訓練流程的完全透明性和易於重現性。代碼已在https://github.com/mathllm/MathCoder2 上釋出。
English
Code has been shown to be effective in enhancing the mathematical reasoning
abilities of large language models due to its precision and accuracy. Previous
works involving continued mathematical pretraining often include code that
utilizes math-related packages, which are primarily designed for fields such as
engineering, machine learning, signal processing, or module testing, rather
than being directly focused on mathematical reasoning. In this paper, we
introduce a novel method for generating mathematical code accompanied with
corresponding reasoning steps for continued pretraining. Our approach begins
with the construction of a high-quality mathematical continued pretraining
dataset by incorporating math-related web data, code using mathematical
packages, math textbooks, and synthetic data. Next, we construct reasoning
steps by extracting LaTeX expressions, the conditions needed for the
expressions, and the results of the expressions from the previously collected
dataset. Based on this extracted information, we generate corresponding code to
accurately capture the mathematical reasoning process. Appending the generated
code to each reasoning step results in data consisting of paired natural
language reasoning steps and their corresponding code. Combining this data with
the original dataset results in a 19.2B-token high-performing mathematical
pretraining corpus, which we name MathCode-Pile. Training several popular base
models with this corpus significantly improves their mathematical abilities,
leading to the creation of the MathCoder2 family of models. All of our data
processing and training code is open-sourced, ensuring full transparency and
easy reproducibility of the entire data collection and training pipeline. The
code is released at https://github.com/mathllm/MathCoder2 .Summary
AI-Generated Summary