通过自动化过程监督提升语言模型中的数学推理
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
June 5, 2024
作者: Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, Abhinav Rastogi
cs.AI
摘要
复杂的多步推理任务,例如解决数学问题或生成代码,即使对于最先进的大型语言模型(LLMs)也仍然是一个重要障碍。利用结果奖励模型(ORM)验证LLM输出是一种标准的推理时技术,旨在增强LLMs的推理性能。然而,对于具有冗长或多跳推理链的推理任务,其中间结果既没有得到适当的奖励也没有受到惩罚,这种方法仍然不足。过程监督通过在推理过程中分配中间奖励来解决这一限制。迄今为止,用于收集过程监督数据的方法要么依赖于人工注释,要么依赖于每步蒙特卡洛估计,这两种方法都难以扩展,从而阻碍了这一技术的广泛应用。针对这一挑战,我们提出了一种名为OmegaPRM的新型分而治之风格的蒙特卡洛树搜索(MCTS)算法,用于高效收集高质量的过程监督数据。该算法通过二分搜索快速识别“思维链”(CoT)中的第一个错误,并平衡正负样本,从而确保效率和质量。因此,我们能够收集超过150万个过程监督注释,以训练一个过程奖励模型(PRM)。利用这种完全自动化的过程监督以及加权自一致性算法,我们提高了经过调整的Gemini Pro模型的数学推理性能,在MATH基准测试中取得了69.4%的成功率,相对于51%的基准模型性能提高了36%。此外,整个过程无需任何人工干预,使我们的方法在财务和计算成本上都比现有方法更具成本效益。
English
Complex multi-step reasoning tasks, such as solving mathematical problems or
generating code, remain a significant hurdle for even the most advanced large
language models (LLMs). Verifying LLM outputs with an Outcome Reward Model
(ORM) is a standard inference-time technique aimed at enhancing the reasoning
performance of LLMs. However, this still proves insufficient for reasoning
tasks with a lengthy or multi-hop reasoning chain, where the intermediate
outcomes are neither properly rewarded nor penalized. Process supervision
addresses this limitation by assigning intermediate rewards during the
reasoning process. To date, the methods used to collect process supervision
data have relied on either human annotation or per-step Monte Carlo estimation,
both prohibitively expensive to scale, thus hindering the broad application of
this technique. In response to this challenge, we propose a novel
divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named
OmegaPRM for the efficient collection of high-quality process
supervision data. This algorithm swiftly identifies the first error in the
Chain of Thought (CoT) with binary search and balances the positive and
negative examples, thereby ensuring both efficiency and quality. As a result,
we are able to collect over 1.5 million process supervision annotations to
train a Process Reward Model (PRM). Utilizing this fully automated process
supervision alongside the weighted self-consistency algorithm, we have enhanced
the instruction tuned Gemini Pro model's math reasoning performance, achieving
a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from
the 51\% base model performance. Additionally, the entire process operates
without any human intervention, making our method both financially and
computationally cost-effective compared to existing methods.Summary
AI-Generated Summary