言語モデルの数学的推論能力を自動化されたプロセス監視によって向上させる

要旨

複雑な多段階推論タスク、例えば数学的問題の解決やコード生成は、最も先進的な大規模言語モデル（LLM）にとっても依然として大きな障壁となっています。Outcome Reward Model（ORM）を用いてLLMの出力を検証することは、推論時の標準的な技術であり、LLMの推論性能を向上させることを目的としています。しかし、長い推論チェーンや多段階推論を要するタスクにおいては、中間結果が適切に報酬やペナルティを与えられないため、この手法は依然として不十分です。プロセス監視は、推論プロセス中に中間報酬を割り当てることでこの制限を解決します。これまで、プロセス監視データを収集する方法は、人間による注釈またはステップごとのモンテカルロ推定に依存しており、いずれも拡張するには非常に高コストであるため、この技術の広範な応用を妨げていました。この課題に対応するため、我々は高品質なプロセス監視データを効率的に収集するための新しい分割統治型モンテカルロ木探索（MCTS）アルゴリズム「OmegaPRM」を提案します。このアルゴリズムは、二分探索を用いてChain of Thought（CoT）における最初のエラーを迅速に特定し、正例と負例のバランスを取ることで、効率と品質の両方を確保します。その結果、我々は150万以上のプロセス監視注釈を収集し、Process Reward Model（PRM）を訓練することができました。この完全自動化されたプロセス監視と加重自己整合性アルゴリズムを活用することで、命令調整済みGemini Proモデルの数学的推論性能を向上させ、MATHベンチマークにおいて69.4%の成功率を達成し、ベースモデルの51%から36%の相対的改善を実現しました。さらに、このプロセス全体は人間の介入なしで動作するため、既存の方法と比較して財務的および計算コストの面で効率的です。

English

Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named OmegaPRM for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM). Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.

言語モデルの数学的推論能力を自動化されたプロセス監視によって向上させる

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

要旨

Support