OpenMathInstruct-1：一個包含180萬條數學指導調整數據的數據集。

摘要

最近的研究表明，合成生成的資料集對於訓練大型語言模型（LLMs）具有巨大潛力，特別是用於獲取特定技能。目前大規模數學教學調整資料集，如MetaMathQA（Yu等，2024年）和MAmmoTH（Yue等，2024年），是使用具有商業限制許可的封閉源LLMs的輸出構建而成。限制在這些資料生成流程中使用開源LLMs的一個關鍵原因是，最佳封閉源LLMs（如GPT-4）的數學技能與最佳開源LLMs之間存在較大差距。基於最近在開源LLMs中的進展，我們提出了提示新穎性和一些粗暴擴展，我們構建了OpenMathInstruct-1，一個包含180萬問題-解決方案對的數學教學調整資料集。該資料集是通過使用最近釋出並採用寬鬆許可的Mixtral模型，為GSM8K和MATH兩個流行的數學推理基準合成代碼解釋器解決方案而構建的。我們的最佳模型OpenMath-CodeLlama-70B，在OpenMathInstruct-1的子集上訓練，GSM8K得分為84.6％，MATH得分為50.7％，與最佳gpt-distilled模型相競爭。我們在商業寬鬆許可下釋出我們的代碼、模型和OpenMathInstruct-1資料集。

English

Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu et al., 2024) and MAmmoTH (Yue et al., 2024) are constructed using outputs from closed-source LLMs with commercially restrictive licenses. A key reason limiting the use of open-source LLMs in these data generation pipelines has been the wide gap between the mathematical skills of the best closed-source LLMs, such as GPT-4, and the best open-source LLMs. Building on the recent progress in open-source LLMs, our proposed prompting novelty, and some brute-force scaling, we construct OpenMathInstruct-1, a math instruction tuning dataset with 1.8M problem-solution pairs. The dataset is constructed by synthesizing code-interpreter solutions for GSM8K and MATH, two popular math reasoning benchmarks, using the recently released and permissively licensed Mixtral model. Our best model, OpenMath-CodeLlama-70B, trained on a subset of OpenMathInstruct-1, achieves a score of 84.6% on GSM8K and 50.7% on MATH, which is competitive with the best gpt-distilled models. We release our code, models, and the OpenMathInstruct-1 dataset under a commercially permissive license.

OpenMathInstruct-1：一個包含180萬條數學指導調整數據的數據集。

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

摘要

Support