ChatPaper.aiChatPaper

LLaMA-Berry:O1级奥林匹克水平数学推理的成对优化

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

October 3, 2024
作者: Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, Dongzhan Zhou
cs.AI

摘要

本文提出了一种先进的数学问题解决框架LLaMA-Berry,用于增强大型语言模型(LLMs)的数学推理能力。该框架将蒙特卡洛树搜索(MCTS)与迭代自我优化相结合,以优化推理路径,并利用成对奖励模型全局评估不同路径。通过利用LLMs的自我批评和重写能力,应用于MCTS的自我优化(SR-MCTS)克服了传统逐步和贪婪搜索算法的低效性和局限性,促进了对解空间更高效的探索。成对偏好奖励模型(PPRM),受到人类反馈强化学习(RLHF)的启发,用于对解决方案之间的成对偏好进行建模,利用增强波达计数(EBC)方法将这些偏好综合成全局排名分数,以找到更好的答案。该方法解决了数学推理任务中评分变化和非独立分布的挑战。该框架已在一般和高级基准测试中进行了测试,相对于现有方法如ToT和rStar,在复杂的奥林匹克水平基准测试中,包括GPQA、AIME24和AMC23,显示出更优越的搜索效率和问题解决能力。
English
This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

Summary

AI-Generated Summary

PDF554November 16, 2024