TinyGSM：使用小型語言模型在GSM8k上實現超過80%的準確率

摘要

小型模型提供各種計算優勢，但模型大小對於解決問題能力的重要性仍是一個懸而未決的問題。特別是在解決小學數學問題時，目前在GSM8K基準測試中需要的最小模型大小仍然是34B才能突破80％的閾值。我們的研究探討了高質量數據集如何成為小型語言模型獲得數學推理能力的關鍵。我們引入了TinyGSM，這是一個包含1230萬個小學數學問題及其對應Python解決方案的合成數據集，完全由GPT-3.5生成。在TinyGSM上進行微調後，我們發現一個由13億生成模型和13億驗證模型組成的雙模型組合可以實現81.5％的準確率，優於數量級更大的現有模型。這也與GPT-3.5的“教師”模型（77.4％）的性能相媲美，我們的模型訓練數據就是從該模型生成的。我們的方法簡單明瞭，包括兩個關鍵組件：1）高質量數據集TinyGSM，2）使用驗證器，從多個候選生成中選擇最終輸出。

English

Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80\% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning. We introduce TinyGSM, a synthetic dataset of 12.3M grade school math problems paired with Python solutions, generated fully by GPT-3.5. After finetuning on TinyGSM, we find that a duo of a 1.3B generation model and a 1.3B verifier model can achieve 81.5\% accuracy, outperforming existing models that are orders of magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher'' model (77.4\%), from which our model's training data is generated. Our approach is simple and has two key components: 1) the high-quality dataset TinyGSM, 2) the use of a verifier, which selects the final outputs from multiple candidate generations.

TinyGSM：使用小型語言模型在GSM8k上實現超過80%的準確率

TinyGSM: achieving >80% on GSM8k with small language models

摘要

Support