常見的7B語言模型已經具備強大的數學能力。

摘要

先前人們普遍認為數學能力只會在大規模的通用語言模型中出現，或者需要進行大量與數學相關的預訓練。本文顯示，具有通用預訓練的LLaMA-2 7B模型已展現出強大的數學能力，其在GSM8K和MATH基準測試中，當從256個隨機生成的回應中選擇最佳回答時，準確率分別達到97.7%和72.0%。目前基本模型的主要問題在於難以一貫地引發其固有的數學能力。值得注意的是，對於第一個答案的準確率分別下降至49.5%和7.9%。我們發現，簡單地擴大SFT數據可以顯著提高生成正確答案的可靠性。然而，公開可用的數學問題稀缺限制了大規模擴展的潛力。為了克服這一限制，我們使用合成數據，證明其幾乎與真實數據一樣有效，並在擴展至約一百萬個樣本時並未顯示明顯飽和。這種簡單直接的方法使用LLaMA-2 7B模型在GSM8K和MATH上實現了82.6%和40.6%的準確率，分別超過先前模型14.2%和20.8%。我們還提供了關於不同推理複雜性和錯誤類型的擴展行為的見解。

English

Mathematical capabilities were previously believed to emerge in common language models only at a very large scale or require extensive math-related pre-training. This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities, as evidenced by its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks, respectively, when selecting the best response from 256 random generations. The primary issue with the current base model is the difficulty in consistently eliciting its inherent mathematical capabilities. Notably, the accuracy for the first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks, respectively. We find that simply scaling up the SFT data can significantly enhance the reliability of generating correct answers. However, the potential for extensive scaling is constrained by the scarcity of publicly available math questions. To overcome this limitation, we employ synthetic data, which proves to be nearly as effective as real data and shows no clear saturation when scaled up to approximately one million samples. This straightforward approach achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B models, surpassing previous models by 14.2% and 20.8%, respectively. We also provide insights into scaling behaviors across different reasoning complexities and error types.

常見的7B語言模型已經具備強大的數學能力。

Common 7B Language Models Already Possess Strong Math Capabilities

摘要

Support