常见的7B语言模型已经具备了强大的数学能力。

摘要

以前人们普遍认为数学能力只会在规模非常大的通用语言模型中出现，或者需要进行大量与数学相关的预训练。本文表明，具有通用预训练的LLaMA-2 7B模型已经展现出强大的数学能力，其在GSM8K和MATH基准测试中选择256个随机生成的最佳响应时，准确率分别达到了97.7%和72.0%。当前基础模型的主要问题在于难以始终激发其固有的数学能力。值得注意的是，对于第一个答案的准确率在GSM8K和MATH基准测试中分别降至49.5%和7.9%。我们发现，简单地扩大SFT数据可以显著提升生成正确答案的可靠性。然而，公开可用的数学问题稀缺限制了大规模扩展的潜力。为了克服这一限制，我们使用合成数据，证明其几乎与真实数据一样有效，并且在扩展到大约一百万个样本时没有明显的饱和。这种简单直接的方法利用LLaMA-2 7B模型在GSM8K和MATH上分别取得了82.6%和40.6%的准确率，分别超过先前模型14.2%和20.8%。我们还提供了关于不同推理复杂性和错误类型的扩展行为的见解。

English

Mathematical capabilities were previously believed to emerge in common language models only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities, as evidenced by its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. The primary issue with the current base model is the difficulty in consistently eliciting its inherent mathematical capabilities. Notably, the accuracy for the first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks, respectively. We find that simply scaling up the SFT data can significantly enhance the reliability of generating correct answers. However, the potential for extensive scaling is constrained by the scarcity of publicly available math questions. To overcome this limitation, we employ synthetic data, which proves to be nearly as effective as real data and shows no clear saturation when scaled up to approximately one million samples. This straightforward approach achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B models, surpassing previous models by 14.2% and 20.8%, respectively. We also provide insights into scaling behaviors across different reasoning complexities and error types.

常见的7B语言模型已经具备了强大的数学能力。

Common 7B Language Models Already Possess Strong Math Capabilities

摘要

Support