AutoMathText:利用语言模型实现数学文本的自主数据选择
AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts
February 12, 2024
作者: Yifan Zhang, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao
cs.AI
摘要
为了通过持续预训练提高语言模型在数学推理方面的熟练度,我们引入了一种新颖的策略,利用基础语言模型进行自主数据选择。与传统的监督微调或使用人工标注数据训练的分类器不同,我们的方法利用元提示语言模型作为零样本验证器,自主评估和选择高质量的数学内容,并发布了涵盖超过200GB数据的经过筛选的开源AutoMathText数据集。为了展示我们方法的有效性,我们持续在AutoMathText数据集上对一个拥有7B参数的Mistral语言模型进行预训练,在MATH数据集上实现了显著的下游性能提升,与先前的持续预训练工作相比,标记数量减少了数个数量级。我们的方法展示了与基准线相比预训练标记效率增加了2倍,突显了我们方法在增强模型数学推理能力方面的潜力。AutoMathText数据集可在https://huggingface.co/datasets/math-ai/AutoMathText 获取。代码可在https://github.com/yifanzhang-pro/AutoMathText 获取。
English
To improve language models' proficiency in mathematical reasoning via
continual pretraining, we introduce a novel strategy that leverages base
language models for autonomous data selection. Departing from conventional
supervised fine-tuning or trained classifiers with human-annotated data, our
approach utilizes meta-prompted language models as zero-shot verifiers to
autonomously evaluate and select high-quality mathematical content, and we
release the curated open-source AutoMathText dataset encompassing over 200GB of
data. To demonstrate the efficacy of our method, we continuously pretrained a
7B-parameter Mistral language model on the AutoMathText dataset, achieving
substantial improvements in downstream performance on the MATH dataset with a
token amount reduced by orders of magnitude compared to previous continuous
pretraining works. Our method showcases a 2 times increase in pretraining token
efficiency compared to baselines, underscoring the potential of our approach in
enhancing models' mathematical reasoning capabilities. The AutoMathText dataset
is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code
is available at https://github.com/yifanzhang-pro/AutoMathText.