LLaMA-Berry：O1のようなオリンピアードレベルの数学的推論のためのペアワイズ最適化

要旨

本論文では、大規模言語モデル（LLM）の数学的推論能力を向上させるための高度な数学的問題解決フレームワーク、LLaMA-Berryを提案します。このフレームワークは、Monte Carlo Tree Search（MCTS）を反復的なSelf-Refineと組み合わせて推論経路を最適化し、異なる経路をグローバルに評価するためのペアワイズ報酬モデルを活用しています。LLMの自己批評能力と書き換え能力を活用することで、MCTSに適用されるSelf-Refine（SR-MCTS）は、従来の段階的および貪欲な探索アルゴリズムの非効率性と制約を克服し、解空間のより効率的な探索を促進します。Reinforcement Learning from Human Feedback（RLHF）から着想を得たペアワイズ優先報酬モデル（PPRM）は、解の間のペアワイズな選好をモデル化し、これらの選好をグローバルなランキングスコアに統合するためにEnhanced Borda Count（EBC）法を活用して、より良い回答を見つけます。このアプローチは、数学的推論課題におけるスコアの変動性と非独立分布の課題に対処しています。このフレームワークは一般的および高度なベンチマークでテストされ、GPQA、AIME24、AMC23などの複雑なオリンピアードレベルのベンチマークにおいて、ToTやrStarなどの既存手法と比較して、探索効率と問題解決能力において優れた性能を示しました。

English

This paper presents an advanced mathematical problem-solving framework, LLaMA-Berry, for enhancing the mathematical reasoning ability of Large Language Models (LLMs). The framework combines Monte Carlo Tree Search (MCTS) with iterative Self-Refine to optimize the reasoning path and utilizes a pairwise reward model to evaluate different paths globally. By leveraging the self-critic and rewriting capabilities of LLMs, Self-Refine applied to MCTS (SR-MCTS) overcomes the inefficiencies and limitations of conventional step-wise and greedy search algorithms by fostering a more efficient exploration of solution spaces. Pairwise Preference Reward Model~(PPRM), inspired by Reinforcement Learning from Human Feedback (RLHF), is then used to model pairwise preferences between solutions, utilizing an Enhanced Borda Count (EBC) method to synthesize these preferences into a global ranking score to find better answers. This approach addresses the challenges of scoring variability and non-independent distributions in mathematical reasoning tasks. The framework has been tested on general and advanced benchmarks, showing superior performance in terms of search efficiency and problem-solving capability compared to existing methods like ToT and rStar, particularly in complex Olympiad-level benchmarks, including GPQA, AIME24 and AMC23.

LLaMA-Berry：O1のようなオリンピアードレベルの数学的推論のためのペアワイズ最適化

LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level Mathematical Reasoning

要旨

Support