ChatPaper.aiChatPaper

改進大型語言模型微調以解決數學問題

Improving Large Language Model Fine-tuning for Solving Math Problems

October 16, 2023
作者: Yixin Liu, Avi Singh, C. Daniel Freeman, John D. Co-Reyes, Peter J. Liu
cs.AI

摘要

儘管大型語言模型(LLMs)在許多自然語言任務中取得成功,但解決數學問題對它們來說仍然是一個重大挑戰。在解決數學問題方面,LLMs在單次通過和N次通過的表現之間存在著很大差距,這表明LLMs可能已經接近找到正確解決方案,這促使我們探索微調方法以提升LLMs的性能。我們使用具有挑戰性的MATH數據集,研究了三種微調策略:(1)解決方案微調,我們微調以生成給定數學問題的詳細解決方案;(2)解決方案集群重新排名,LLM被微調為解決方案驗證者/評估者,以從生成的候選解決方案集群中進行選擇;(3)多任務順序微調,將解決方案生成和評估任務有效地整合在一起,以增強LLM的性能。通過這些方法,我們對一系列PaLM 2模型進行了深入的實證研究,發現:(1)用於微調的逐步解決方案的質量和風格對模型性能有重大影響;(2)當單獨使用解決方案重新排名和多數投票來提高模型性能時,它們都是有效的,但也可以一起使用以獲得更大的性能提升;(3)將解決方案生成和評估任務進行順序分離的多任務微調,可以提供比解決方案微調基準更好的性能。在這些見解的指導下,我們設計了一個微調配方,在使用微調的PaLM 2-L模型上在MATH數據集上實現了約58.8%的準確率,比預先訓練的PaLM 2-L模型的少數投票性能提高了11.2%。
English
Despite their success in many natural language tasks, solving math problems remains a significant challenge for large language models (LLMs). A large gap exists between LLMs' pass-at-one and pass-at-N performance in solving math problems, suggesting LLMs might be close to finding correct solutions, motivating our exploration of fine-tuning methods to unlock LLMs' performance. Using the challenging MATH dataset, we investigate three fine-tuning strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed solution for a given math problem; (2) solution-cluster re-ranking, where the LLM is fine-tuned as a solution verifier/evaluator to choose among generated candidate solution clusters; (3) multi-task sequential fine-tuning, which integrates both solution generation and evaluation tasks together efficiently to enhance the LLM performance. With these methods, we present a thorough empirical study on a series of PaLM 2 models and find: (1) The quality and style of the step-by-step solutions used for fine-tuning can make a significant impact on the model performance; (2) While solution re-ranking and majority voting are both effective for improving the model performance when used separately, they can also be used together for an even greater performance boost; (3) Multi-task fine-tuning that sequentially separates the solution generation and evaluation tasks can offer improved performance compared with the solution fine-tuning baseline. Guided by these insights, we design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the few-shot performance of pre-trained PaLM 2-L model with majority voting.
PDF71December 15, 2024