MathCoder2: 모델 번역 수학 코드에서 계속된 사전학습을 통한 더 나은 수학 추론

초록

코드는 정밀성과 정확성으로 인해 대형 언어 모델의 수학적 추론 능력을 향상시키는 데 효과적임이 입증되었습니다. 이전의 수학 사전학습을 포함하는 작업들은 주로 공학, 기계 학습, 신호 처리 또는 모듈 테스트와 같은 분야를 위해 설계된 수학 관련 패키지를 활용하는 코드를 포함하고 있었으며, 직접적으로 수학적 추론에 초점을 맞추지는 않았습니다. 본 논문에서는 수학적 코드를 생성하는 새로운 방법을 소개하며, 해당 코드에 대응하는 추론 단계를 수행하여 사전학습을 계속하는 방법을 제시합니다. 저희의 접근 방식은 수학 관련 웹 데이터, 수학 패키지를 사용한 코드, 수학 교과서 및 합성 데이터를 통합하여 고품질의 수학 사전학습 데이터 세트를 구축하는 것으로 시작합니다. 그 다음, 이전에 수집한 데이터 세트에서 LaTeX 표현식, 표현식에 필요한 조건 및 표현식의 결과를 추출하여 추론 단계를 구성합니다. 이 추출된 정보를 기반으로 수학적 추론 과정을 정확하게 포착하기 위해 해당 코드를 생성합니다. 생성된 코드를 각 추론 단계에 추가하면 자연어 추론 단계와 해당 코드로 구성된 데이터가 생성됩니다. 이 데이터를 원래 데이터 세트와 결합하면 192억 토큰의 고성능 수학 사전학습 말뭉치인 MathCode-Pile이 생성됩니다. 이 말뭉치를 사용하여 여러 인기 있는 기본 모델을 교육하면 그들의 수학 능력이 크게 향상되어 MathCoder2 모델 패밀리가 생성됩니다. 저희의 모든 데이터 처리 및 교육 코드는 오픈 소스로 제공되어 전체 데이터 수집 및 교육 파이프라인의 완전한 투명성과 쉬운 재현성을 보장합니다. 해당 코드는 https://github.com/mathllm/MathCoder2 에서 공개되었습니다.

English

Code has been shown to be effective in enhancing the mathematical reasoning abilities of large language models due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathllm/MathCoder2 .

MathCoder2: 모델 번역 수학 코드에서 계속된 사전학습을 통한 더 나은 수학 추론

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code

초록

Support