通過自動化過程監督來提升語言模型中的數學推理
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
June 5, 2024
作者: Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi
cs.AI
摘要
複雜的多步驗證任務,例如解決數學問題或生成代碼,即使對於最先進的大型語言模型(LLMs)來說,仍然是一個重要障礙。通過使用結果獎勵模型(ORM)在推理時對LLM輸出進行驗證是一種標準的技術,旨在提高LLMs的推理性能。然而,這對於具有冗長或多跳推理鏈的推理任務仍然不足,其中中間結果既未得到適當獎勵也未受到懲罰。過程監督通過在推理過程中分配中間獎勵來解決這一限制。迄今為止,用於收集過程監督數據的方法依賴於人工標註或每步蒙特卡洛估計,這兩者都難以擴展,因此阻礙了這種技術的廣泛應用。為應對這一挑戰,我們提出了一種名為OmegaPRM的新型分治式蒙特卡羅樹搜索(MCTS)算法,用於高效收集高質量的過程監督數據。該算法通過二分搜索迅速識別了“思維鏈”(CoT)中的第一個錯誤,並平衡了正面和負面示例,從而確保了效率和質量。因此,我們能夠收集超過150萬個過程監督標註,以訓練一個過程獎勵模型(PRM)。利用這種完全自動化的過程監督以及加權自一致性算法,我們提高了指導調整的Gemini Pro模型在數學推理性能上的表現,實現了在MATH基準測試中69.4%的成功率,比51%的基礎模型性能提高了36%。此外,整個過程無需任何人工干預,使我們的方法在財務和計算成本上相比現有方法更具成本效益。
English
Complex multi-step reasoning tasks, such as solving mathematical problems or
generating code, remain a significant hurdle for even the most advanced large
language models (LLMs). Verifying LLM outputs with an Outcome Reward Model
(ORM) is a standard inference-time technique aimed at enhancing the reasoning
performance of LLMs. However, this still proves insufficient for reasoning
tasks with a lengthy or multi-hop reasoning chain, where the intermediate
outcomes are neither properly rewarded nor penalized. Process supervision
addresses this limitation by assigning intermediate rewards during the
reasoning process. To date, the methods used to collect process supervision
data have relied on either human annotation or per-step Monte Carlo estimation,
both prohibitively expensive to scale, thus hindering the broad application of
this technique. In response to this challenge, we propose a novel
divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named
OmegaPRM for the efficient collection of high-quality process
supervision data. This algorithm swiftly identifies the first error in the
Chain of Thought (CoT) with binary search and balances the positive and
negative examples, thereby ensuring both efficiency and quality. As a result,
we are able to collect over 1.5 million process supervision annotations to
train a Process Reward Model (PRM). Utilizing this fully automated process
supervision alongside the weighted self-consistency algorithm, we have enhanced
the instruction tuned Gemini Pro model's math reasoning performance, achieving
a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from
the 51\% base model performance. Additionally, the entire process operates
without any human intervention, making our method both financially and
computationally cost-effective compared to existing methods.Summary
AI-Generated Summary