LLaMA-Berry: O1 수준의 올림피아드 수학 추론을 위한 쌍별 최적화

초록

본 논문은 대규모 언어 모델(Large Language Models, LLMs)의 수학 추론 능력을 향상시키기 위한 고급 수학 문제 해결 프레임워크인 LLaMA-Berry를 제안합니다. 이 프레임워크는 몬테카를로 트리 탐색(Monte Carlo Tree Search, MCTS)를 반복적인 Self-Refine과 결합하여 추론 경로를 최적화하고, 서로 다른 경로를 전역적으로 평가하기 위해 쌍으로 보상 모델을 활용합니다. LLM의 자가 비평과 재작성 능력을 활용하여 MCTS에 적용된 Self-Refine(SR-MCTS)는 솔루션 공간을 더 효율적으로 탐색함으로써 기존의 단계별 및 탐욕 알고리즘의 비효율성과 한계를 극복합니다. 인간 피드백으로부터 강화 학습을 영감받은 쌍별 선호 보상 모델(PPRM)은 솔루션 간 쌍별 선호도를 모델링하고, 이러한 선호도를 전역 순위 점수로 합성하기 위해 향상된 보르다 카운트(EBC) 방법을 활용하여 더 나은 답변을 찾습니다. 이 접근 방식은 수학 추론 작업에서의 점수 변동성과 비독립적 분포의 문제에 대응합니다. 본 프레임워크는 일반 및 고급 벤치마크에서 테스트되었으며, GPQA, AIME24 및 AMC23을 포함한 복잡한 올림피아드 수준 벤치마크에서 기존 방법인 ToT와 rStar와 비교하여 탐색 효율성 및 문제 해결 능력 측면에서 우수한 성능을 보여주었습니다.

English

This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

LLaMA-Berry: O1 수준의 올림피아드 수학 추론을 위한 쌍별 최적화

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

초록

Support