AutoMathText: 数学テキストのための言語モデルを用いた自律的データ選択

要旨

言語モデルの数学的推論能力を向上させるため、継続的プレトレーニングにおいて基盤言語モデルを活用した自律的なデータ選択戦略を新たに提案します。従来の教師ありファインチューニングや人間によるアノテーションデータを用いた分類器とは異なり、本手法ではメタプロンプト化された言語モデルをゼロショット検証器として活用し、高品質な数学的コンテンツを自律的に評価・選択します。また、200GB以上のデータを網羅したオープンソースのAutoMathTextデータセットを公開しました。本手法の有効性を実証するため、7BパラメータのMistral言語モデルをAutoMathTextデータセットで継続的にプレトレーニングし、MATHデータセットにおける下流タスクのパフォーマンスを大幅に向上させました。これにより、従来の継続的プレトレーニング研究と比較して、トークン量を桁違いに削減することに成功しました。本手法は、ベースラインと比較して2倍のプレトレーニングトークン効率を示し、モデルの数学的推論能力を強化する本アプローチの可能性を強調しています。AutoMathTextデータセットはhttps://huggingface.co/datasets/math-ai/AutoMathTextで、コードはhttps://github.com/yifanzhang-pro/AutoMathTextで公開されています。

English

To improve language models' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach utilizes meta-prompted language models as zero-shot verifiers to autonomously evaluate and select high-quality mathematical content, and we release the curated open-source AutoMathText dataset encompassing over 200GB of data. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter Mistral language model on the AutoMathText dataset, achieving substantial improvements in downstream performance on the MATH dataset with a token amount reduced by orders of magnitude compared to previous continuous pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

AutoMathText: 数学テキストのための言語モデルを用いた自律的データ選択

AutoMathText: Autonomous Data Selection with Language Models for Mathematical Texts

要旨

Support