相互推理使较小的LLM更强大的问题解决者。

摘要

本文介绍了rStar，这是一种自我对弈的相互推理方法，显著提高了小语言模型（SLMs）的推理能力，而无需微调或优越模型。rStar将推理分解为自我对弈的生成-辨别过程。首先，目标SLM将蒙特卡洛树搜索（MCTS）与丰富的类人推理动作相结合，以构建更高质量的推理轨迹。接下来，另一个具有与目标SLM相似能力的SLM充当鉴别器，验证目标SLM生成的每个轨迹。双方一致的推理轨迹被认为是相互一致的，因此更有可能是正确的。对五个SLMs进行的大量实验表明，rStar能够有效解决各种推理问题，包括GSM8K、GSM-Hard、MATH、SVAMP和StrategyQA。值得注意的是，rStar将LLaMA2-7B的GSM8K准确率从12.51%提高到63.91%，将Mistral-7B的准确率从36.46%提高到81.88%，将LLaMA3-8B-Instruct的准确率从74.53%提高到91.13%。代码将在https://github.com/zhentingqi/rStar 上提供。

English

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.

相互推理使较小的LLM更强大的问题解决者。

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

摘要

Support