MathCoder2: モデルによる数学的コードの翻訳を継続的に事前学習することで、より優れた数学的推論を実現

要旨

コードは、その精度と正確性により、大規模言語モデルの数学的推論能力を向上させるのに効果的であることが示されています。以前の研究では、数学的事前トレーニングを継続的に行う際には、主にエンジニアリング、機械学習、信号処理、モジュールテストなどの分野向けに設計された数学関連のパッケージを使用するコードが含まれていましたが、直接的に数学的推論に焦点を当てたものではありませんでした。本論文では、数学的コードとそれに付随する推論手順を生成するための新しい手法を紹介します。我々のアプローチは、数学関連のウェブデータ、数学的パッケージを使用したコード、数学の教科書、合成データを組み込むことで、高品質な数学的事前トレーニングデータセットを構築することから始まります。次に、以前に収集したデータセットからLaTeX式、式に必要な条件、および式の結果を抽出して推論手順を構築します。この抽出された情報に基づいて、数学的推論プロセスを正確に捉えるための対応するコードを生成します。生成されたコードを各推論手順に追加することで、自然言語の推論手順とそれに対応するコードからなるデータが得られます。このデータを元のデータセットと組み合わせることで、19.2Bトークンの高性能数学事前トレーニングコーパス「MathCode-Pile」が生成されます。このコーパスを使用していくつかの人気ベースモデルをトレーニングすると、彼らの数学的能力が著しく向上し、MathCoder2モデルファミリーが作成されます。すべてのデータ処理とトレーニングコードはオープンソースで公開されており、データ収集およびトレーニングパイプライン全体の透明性と再現性を確保しています。コードは https://github.com/mathllm/MathCoder2 で公開されています。

English

Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathllm/MathCoder2 .

MathCoder2: モデルによる数学的コードの翻訳を継続的に事前学習することで、より優れた数学的推論を実現

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

要旨

Support