ChatPaper.aiChatPaper

Skywork-Math:大型语言模型中数学推理的数据缩放定律 —— 故事继续

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

July 11, 2024
作者: Liang Zeng, Liangjun Zhong, Liang Zhao, Tianwen Wei, Liu Yang, Jujie He, Cheng Cheng, Rui Hu, Yang Liu, Shuicheng Yan, Han Fang, Yahui Zhou
cs.AI

摘要

本文探讨了潜在增强大型语言模型(LLMs)数学推理能力的基本因素。我们认为,现代LLMs中数学推理能力的数据缩放规律远未饱和,突出了模型质量随数据量增加而提高的情况。为了支持这一观点,我们介绍了Skywork-Math模型系列,这些模型在常见的7B LLMs上进行了监督微调(SFT),使用我们提出的250万实例的Skywork-MathQA数据集。Skywork-Math 7B 在竞赛级别的MATH基准测试上取得了51.2%的令人印象深刻的准确率,并且在GSM8K基准测试上达到了83.9%的准确率,仅使用SFT数据,胜过了MATH上的GPT-4早期版本。Skywork-Math模型的卓越性能归功于我们的新颖的两阶段数据合成和模型SFT流程,其中包括三种不同的增强方法和多样的种子问题集,确保了Skywork-MathQA数据集在不同难度级别上的数量和质量。最重要的是,我们提供了一些实用的经验教训,以增强LLMs的数学推理能力,适用于研究和工业应用。
English
In this paper, we investigate the underlying factors that potentially enhance the mathematical reasoning capabilities of large language models (LLMs). We argue that the data scaling law for math reasoning capabilities in modern LLMs is far from being saturated, highlighting how the model's quality improves with increases in data quantity. To support this claim, we introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B LLMs using our proposed 2.5M-instance Skywork-MathQA dataset. Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark and 83.9% on the GSM8K benchmark using only SFT data, outperforming an early version of GPT-4 on MATH. The superior performance of Skywork-Math models contributes to our novel two-stage data synthesis and model SFT pipelines, which include three different augmentation methods and a diverse seed problem set, ensuring both the quantity and quality of Skywork-MathQA dataset across varying difficulty levels. Most importantly, we provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.

Summary

AI-Generated Summary

PDF535November 28, 2024