자동화된 프로세스 감독을 통해 언어 모델의 수학적 추론 능력 향상

초록

복잡한 다단계 추론 작업, 예를 들어 수학 문제 해결이나 코드 생성은 가장 발전된 대형 언어 모델(LLMs)에게도 여전히 큰 도전 과제로 남아 있습니다. 결과 보상 모델(ORM)을 사용하여 LLM 출력을 검증하는 것은 LLM의 추론 성능을 향상시키기 위한 표준 추론 시점 기법입니다. 그러나 이 방법은 여전히 길거나 다중 단계 추론 체인이 필요한 작업에서는 충분하지 않으며, 중간 결과가 적절히 보상되거나 처벌되지 않습니다. 프로세스 감독은 이러한 한계를 해결하기 위해 추론 과정 중에 중간 보상을 부여합니다. 현재까지 프로세스 감독 데이터를 수집하는 방법은 인간 주석이나 단계별 몬테카를로 추정에 의존해 왔으며, 이 둘 모두 확장하기에는 비용이 너무 많이 들어 이 기술의 광범위한 적용을 방해해 왔습니다. 이러한 문제에 대응하여, 우리는 고품질 프로세스 감독 데이터를 효율적으로 수집하기 위해 오메가PRM이라는 새로운 분할 정복 스타일의 몬테카를로 트리 탐색(MCTS) 알고리즘을 제안합니다. 이 알고리즘은 이진 탐색을 통해 사고의 연쇄(CoT)에서 첫 번째 오류를 신속하게 식별하고 양성 및 음성 예제를 균형 있게 조정하여 효율성과 품질을 모두 보장합니다. 그 결과, 우리는 프로세스 보상 모델(PRM)을 훈련시키기 위해 150만 개 이상의 프로세스 감독 주석을 수집할 수 있었습니다. 이 완전 자동화된 프로세스 감독과 가중 자기 일관성 알고리즘을 활용하여, 우리는 지시 튜닝된 Gemini Pro 모델의 수학 추론 성능을 향상시켜 MATH 벤치마크에서 69.4%의 성공률을 달성했으며, 이는 기본 모델 성능인 51%에서 36%의 상대적 개선을 나타냅니다. 또한, 전체 과정이 인간 개입 없이 진행되므로, 우리의 방법은 기존 방법에 비해 재정적 및 계산적 비용 측면에서 매우 효율적입니다.

English

Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named OmegaPRM for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM). Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.

자동화된 프로세스 감독을 통해 언어 모델의 수학적 추론 능력 향상

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

초록

Support