使用大型语言模型学习数学推理的规模关系

摘要

对于大型语言模型（LLMs）来说，数学推理是一项具有挑战性的任务，然而其与LLM容量的扩展关系尚未得到充分探讨。本文研究了预训练损失、监督数据量以及增强数据量对监督式LLM推理性能的影响。我们发现，预训练损失是模型性能的更好指标，而不是模型参数数量。我们应用了不同量的监督数据进行监督微调（SFT），并在实证中发现数据量与模型性能之间存在对数线性关系，较好的模型在扩大监督数据集时改进较少。为了增加更多数据样本以提高模型性能而无需人工干预，我们提出应用拒绝抽样微调（RFT）。RFT利用监督模型生成和收集正确的推理路径作为增强微调数据集。我们发现，随着增强样本包含更多不同的推理路径，RFT对LLMs的数学推理性能改进更大。我们还发现，RFT对性能较差的LLMs带来更多改进。此外，我们结合了来自多个模型的拒绝样本，将LLaMA-7B的准确率提升至49.3%，明显优于35.9%的监督微调（SFT）准确率。

English

Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3% and outperforms the supervised fine-tuning (SFT) accuracy of 35.9% significantly.

使用大型语言模型学习数学推理的规模关系

Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

摘要

Support