OpenMathInstruct-1：一个包含180万条数学指导数据的数据集

摘要

最近的研究表明，合成生成的数据集对训练大型语言模型（LLMs）具有巨大潜力，特别是用于获取特定技能。当前大规模数学教学调优数据集，如MetaMathQA（Yu等，2024年）和MAmmoTH（Yue等，2024年），是利用具有商业限制许可的闭源LLMs的输出构建的。限制开源LLMs在这些数据生成流程中使用的一个关键原因是最佳闭源LLMs（如GPT-4）和最佳开源LLMs之间数学技能之间的巨大差距。借鉴最近开源LLMs的进展，我们提出了提示新颖性和一些蛮力扩展，构建了OpenMathInstruct-1，一个包含180万问题-解决方案对的数学教学调优数据集。该数据集通过使用最近发布且许可宽松的Mixtral模型，为GSM8K和MATH这两个流行的数学推理基准合成了代码解释器解决方案。我们的最佳模型OpenMath-CodeLlama-70B，在OpenMathInstruct-1的子集上训练，GSM8K得分为84.6%，MATH得分为50.7%，与最佳gpt-distilled模型相竞争。我们在商业许可下发布我们的代码、模型和OpenMathInstruct-1数据集。

English

Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and MATH, two popular math reasoning benchmarks, using the recently released and permissively licensed Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which is competitive with the best gpt-distilled models. We release our code, models, and the OpenMathInstruct-1 dataset under a commercially permissive license.

OpenMathInstruct-1：一个包含180万条数学指导数据的数据集

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

摘要

Support