AutoMathText:利用語言模型自主選擇數學文本中的數據
AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts
February 12, 2024
作者: Yifan Zhang, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao
cs.AI
摘要
為了透過持續預訓練來提高語言模型在數學推理方面的熟練度,我們引入了一種新穎的策略,利用基礎語言模型進行自主數據選擇。與傳統的監督微調或使用人工標註數據訓練的分類器不同,我們的方法利用元提示語言模型作為零-shot驗證器,自主評估並選擇高質量的數學內容,並釋放了涵蓋超過200GB數據的經過精心挑選的開源AutoMathText數據集。為了展示我們方法的有效性,我們持續對一個擁有7B參數的Mistral語言模型在AutoMathText數據集上進行預訓練,實現了在MATH數據集上下游性能顯著提升,與先前的持續預訓練工作相比,令token數量大幅減少。我們的方法展示了預訓練token效率比基線提高了2倍,突顯了我們方法在增強模型數學推理能力方面的潛力。AutoMathText數據集可在https://huggingface.co/datasets/math-ai/AutoMathText找到。代碼可在https://github.com/yifanzhang-pro/AutoMathText找到。
English
To improve language models' proficiency in mathematical reasoning via
continual pretraining, we introduce a novel strategy that leverages base
language models for autonomous data selection. Departing from conventional
supervised fine-tuning or trained classifiers with human-annotated data, our
approach utilizes meta-prompted language models as zero-shot verifiers to
autonomously evaluate and select high-quality mathematical content, and we
release the curated open-source AutoMathText dataset encompassing over 200GB of
data. To demonstrate the efficacy of our method, we continuously pretrained a
7B-parameter Mistral language model on the AutoMathText dataset, achieving
substantial improvements in downstream performance on the MATH dataset with a
token amount reduced by orders of magnitude compared to previous continuous
pretraining works. Our method showcases a 2 times increase in pretraining token
efficiency compared to baselines, underscoring the potential of our approach in
enhancing models' mathematical reasoning capabilities. The AutoMathText dataset
is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code
is available at https://github.com/yifanzhang-pro/AutoMathText.