ChatPaper.aiChatPaper

Skywork-Math:大型語言模型中數學推理的數據縮放定律──故事繼續进行

Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On

July 11, 2024
作者: Liang Zeng, Liangjun Zhong, Liang Zhao, Tianwen Wei, Liu Yang, Jujie He, Cheng Cheng, Rui Hu, Yang Liu, Shuicheng Yan, Han Fang, Yahui Zhou
cs.AI

摘要

本文探討可能增強大型語言模型(LLMs)數學推理能力的潛在因素。我們認為現代LLMs中數學推理能力的數據擴展法則遠未達飽和,突顯模型質量隨著數據量增加而提高的情況。為了支持這一觀點,我們介紹了Skywork-Math模型系列,通過使用我們提出的250萬實例Skywork-MathQA數據集對常見的7B LLMs進行監督微調(SFT)。Skywork-Math 7B在比賽級MATH基準測試中取得了51.2%的令人印象深刻的準確率,並且在GSM8K基準測試中達到了83.9%的準確率,僅使用SFT數據,勝過了MATH中的GPT-4早期版本。Skywork-Math模型的卓越性能歸因於我們新穎的兩階段數據合成和模型SFT管道,其中包括三種不同的擴增方法和多樣的種子問題集,確保Skywork-MathQA數據集在不同難度水平上的數量和質量。最重要的是,我們提供了一些實用的經驗教訓,以增強LLMs的數學推理能力,適用於研究和行業應用。
English
In this paper, we investigate the underlying factors that potentially enhance the mathematical reasoning capabilities of large language models (LLMs). We argue that the data scaling law for math reasoning capabilities in modern LLMs is far from being saturated, highlighting how the model's quality improves with increases in data quantity. To support this claim, we introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B LLMs using our proposed 2.5M-instance Skywork-MathQA dataset. Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark and 83.9% on the GSM8K benchmark using only SFT data, outperforming an early version of GPT-4 on MATH. The superior performance of Skywork-Math models contributes to our novel two-stage data synthesis and model SFT pipelines, which include three different augmentation methods and a diverse seed problem set, ensuring both the quantity and quality of Skywork-MathQA dataset across varying difficulty levels. Most importantly, we provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.

Summary

AI-Generated Summary

PDF535November 28, 2024