LLaMA-Berry：O1级奥林匹克水平数学推理的成对优化

摘要

本文提出了一种先进的数学问题解决框架LLaMA-Berry，用于增强大型语言模型（LLMs）的数学推理能力。该框架将蒙特卡洛树搜索（MCTS）与迭代自我优化相结合，以优化推理路径，并利用成对奖励模型全局评估不同路径。通过利用LLMs的自我批评和重写能力，应用于MCTS的自我优化（SR-MCTS）克服了传统逐步和贪婪搜索算法的低效性和局限性，促进了对解空间更高效的探索。成对偏好奖励模型（PPRM），受到人类反馈强化学习（RLHF）的启发，用于对解决方案之间的成对偏好进行建模，利用增强波达计数（EBC）方法将这些偏好综合成全局排名分数，以找到更好的答案。该方法解决了数学推理任务中评分变化和非独立分布的挑战。该框架已在一般和高级基准测试中进行了测试，相对于现有方法如ToT和rStar，在复杂的奥林匹克水平基准测试中，包括GPQA、AIME24和AMC23，显示出更优越的搜索效率和问题解决能力。

English

This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

LLaMA-Berry：O1级奥林匹克水平数学推理的成对优化

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

摘要

Support