使用大型語言模型學習數學推理的規模關係

摘要

對於大型語言模型（LLMs）來說，數學推理是一項具有挑戰性的任務，然而其與LLM容量的擴展關係尚未得到充分探討。本文探討了預訓練損失、監督數據量和擴增數據量如何影響監督式LLM的推理表現。我們發現預訓練損失是模型表現的更好指標，而非模型參數數量。我們應用了帶有不同量監督數據的監督微調（SFT），並在實驗中發現數據量與模型表現之間存在對數線性關係，且我們發現更好的模型在擴大監督數據集時改進較少。為了增加更多數據樣本以提高模型表現而無需人力投入，我們提出應用拒絕抽樣微調（RFT）。RFT使用監督模型生成和收集正確推理路徑作為擴增微調數據集。我們發現，隨著擴增樣本包含更多不同推理路徑，RFT對LLMs的數學推理表現有更大改進。我們還發現RFT對表現較差的LLMs帶來更大改進。此外，我們結合來自多個模型的拒絕樣本，將LLaMA-7B的準確率提高至49.3%，明顯優於35.9%的監督微調（SFT）準確率。

English

Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3% and outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.

使用大型語言模型學習數學推理的規模關係

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

摘要

Support