LLaMA-Berry:O1级奥林匹克水平数学推理的成对优化
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning
October 3, 2024
作者: Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, Dongzhan Zhou
cs.AI
摘要
本文提出了一种先进的数学问题解决框架LLaMA-Berry,用于增强大型语言模型(LLMs)的数学推理能力。该框架将蒙特卡洛树搜索(MCTS)与迭代自我优化相结合,以优化推理路径,并利用成对奖励模型全局评估不同路径。通过利用LLMs的自我批评和重写能力,应用于MCTS的自我优化(SR-MCTS)克服了传统逐步和贪婪搜索算法的低效性和局限性,促进了对解空间更高效的探索。成对偏好奖励模型(PPRM),受到人类反馈强化学习(RLHF)的启发,用于对解决方案之间的成对偏好进行建模,利用增强波达计数(EBC)方法将这些偏好综合成全局排名分数,以找到更好的答案。该方法解决了数学推理任务中评分变化和非独立分布的挑战。该框架已在一般和高级基准测试中进行了测试,相对于现有方法如ToT和rStar,在复杂的奥林匹克水平基准测试中,包括GPQA、AIME24和AMC23,显示出更优越的搜索效率和问题解决能力。
English
This paper presents an advanced mathematical problem-solving framework,
LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language
Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with
iterative Self-Refine to optimize the reasoning path and utilizes a pairwise
reward model to evaluate different paths globally. By leveraging the
self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS
(SR-MCTS) overcomes the inefficiencies and limitations of conventional
step-wise and greedy search algorithms by fostering a more efficient
exploration of solution spaces. Pairwise Preference Reward Model~(PPRM),
inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to
model pairwise preferences between solutions, utilizing an Enhanced Borda Count
(EBC) method to synthesize these preferences into a global ranking score to
find better answers. This approach addresses the challenges of scoring
variability and non-independent distributions in mathematical reasoning tasks.
The framework has been tested on general and advanced benchmarks, showing
superior performance in terms of search efficiency and problem-solving
capability compared to existing methods like ToT and rStar, particularly in
complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.Summary
AI-Generated Summary