改进大型语言模型微调以解决数学问题

摘要

尽管大型语言模型（LLMs）在许多自然语言任务中取得成功，但解决数学问题对它们来说仍然是一个重大挑战。在解决数学问题方面，LLMs在一次通过和N次通过之间存在着巨大的差距，这表明LLMs可能接近找到正确解决方案，激发了我们对微调方法的探索，以释放LLMs的性能。我们使用具有挑战性的MATH数据集，研究了三种微调策略：（1）解决方案微调，即微调以生成给定数学问题的详细解决方案；（2）解决方案集重新排序，其中LLM被微调为解决方案验证器/评估器，以在生成的候选解决方案集中进行选择；（3）多任务顺序微调，将解决方案生成和评估任务有效地整合在一起，以增强LLM的性能。通过这些方法，我们对一系列PaLM 2模型进行了彻底的实证研究，并发现：（1）用于微调的逐步解决方案的质量和风格可以对模型性能产生重大影响；（2）当单独使用解决方案重新排序和多数投票时，它们都可以有效地提高模型性能，但它们也可以一起使用以获得更大的性能提升；（3）将解决方案生成和评估任务顺序分开的多任务微调可以提供比解决方案微调基准更好的性能。在这些见解的指导下，我们设计了一个微调配方，使用经过微调的PaLM 2-L模型在MATH数据集上达到了约58.8％的准确率，比经过预训练的PaLM 2-L模型进行多数投票的少样本性能提高了11.2％。

English

Despite their success in many natural language tasks, solving math problems remains a significant challenge for large language models (LLMs). A large gap exists between LLMs' pass-at-one and pass-at-N performance in solving math problems, suggesting LLMs might be close to finding correct solutions, motivating our exploration of fine-tuning methods to unlock LLMs' performance. Using the challenging MATH dataset, we investigate three fine-tuning strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed solution for a given math problem; (2) solution-cluster re-ranking, where the LLM is fine-tuned as a solution verifier/evaluator to choose among generated candidate solution clusters; (3) multi-task sequential fine-tuning, which integrates both solution generation and evaluation tasks together efficiently to enhance the LLM performance. With these methods, we present a thorough empirical study on a series of PaLM 2 models and find: (1) The quality and style of the step-by-step solutions used for fine-tuning can make a significant impact on the model performance; (2) While solution re-ranking and majority voting are both effective for improving the model performance when used separately, they can also be used together for an even greater performance boost; (3) Multi-task fine-tuning that sequentially separates the solution generation and evaluation tasks can offer improved performance compared with the solution fine-tuning baseline. Guided by these insights, we design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the few-shot performance of pre-trained PaLM 2-L model with majority voting.

改进大型语言模型微调以解决数学问题

Improving Large Language Model Fine-tuning for Solving Math Problems

摘要

Support