テンプレートベースのデータ生成を用いた言語モデルの訓練と評価

要旨

大規模言語モデル（LLM）の急速な進化、例えばGPT-3、PaLM、およびLlamaなどは、自然言語処理を大きく変革し、言語の理解と生成において顕著な能力を示しています。ただし、これらのモデルは、複雑な推論を必要とするタスクでしばしば苦労します。特に数学的問題解決においては、洗練された推論能力を訓練するために必要な大規模で高品質な特定領域のデータセットが不足しているためです。この制限に対処するために、私たちはTemplate-based Data Generation（TDG）を導入します。これは、LLM（GPT-4）を活用してパラメータ化されたメタテンプレートを自動生成し、それを使用して多様な高品質の問題と解を合成する革新的な手法です。TDGを活用して、私たちはTemplateMath Part I: TemplateGSMを作成しました。これは、700万以上の合成された小学校の数学問題からなるデータセットで、それぞれがコードベースと自然言語の解とともに提供されており、効果的に無限の問題を生成する可能性があります。このデータセットは大規模な数学データセットの不足を緩和し、数学的推論においてLLMの事前トレーニング、微調整、評価に貴重なリソースとして機能します。私たちの手法は、ほぼ無限のデータ生成だけでなく、GPT-4をメタテンプレート生成に使用することで、多様で高品質な問題構造を確保し、データ拡張を新たなレベルに引き上げます。TemplateMath Part I: TemplateGSMデータセットは、https://huggingface.co/datasets/math-ai/TemplateGSM で公開されています。コードはhttps://github.com/iiis-ai/TemplateMath で入手可能です。

English

The rapid advancement of large language models (LLMs) such as GPT-3, PaLM, and Llama has significantly transformed natural language processing, showcasing remarkable capabilities in understanding and generating language. However, these models often struggle with tasks requiring complex reasoning, particularly in mathematical problem-solving, due in part to the scarcity of large-scale, high-quality, domain-specific datasets necessary for training sophisticated reasoning abilities. To address this limitation, we introduce Template-based Data Generation (TDG), a novel approach that leverages LLMs (GPT-4) to automatically generate parameterized meta-templates, which are then used to synthesize a vast array of high-quality problems and solutions. Leveraging TDG, we create TemplateMath Part I: TemplateGSM, a dataset comprising over 7 million synthetically generated grade school math problems--each accompanied by code-based and natural language solutions--with the potential to generate an effectively unlimited number more. This dataset alleviates the scarcity of large-scale mathematical datasets and serves as a valuable resource for pre-training, fine-tuning, and evaluating LLMs in mathematical reasoning. Our method not only enables the generation of virtually infinite data but also elevates data augmentation to a new level by using GPT-4 for meta-template generation, ensuring diverse and high-quality problem structures. The TemplateMath Part I: TemplateGSM dataset is publicly available at https://huggingface.co/datasets/math-ai/TemplateGSM. The code is available at https://github.com/iiis-ai/TemplateMath.

テンプレートベースのデータ生成を用いた言語モデルの訓練と評価

Training and Evaluating Language Models with Template-based Data Generation

要旨

Support