相互推理使較小的LLM更具解決問題能力

摘要

本文介紹了 rStar，一種自我對弈的共同推理方法，顯著提升了小型語言模型（SLMs）的推理能力，而無需微調或優越模型。rStar將推理分解為自我對弈的生成-辨識過程。首先，目標SLM將蒙特卡羅樹搜索（MCTS）與豐富的類人推理動作結合，以構建更高質量的推理軌跡。接著，另一個具有與目標SLM相似能力的SLM充當鑑別器，驗證目標SLM生成的每個軌跡。雙方一致的推理軌跡被視為相互一致，因此更有可能是正確的。在五個SLM上進行的大量實驗表明，rStar可以有效解決各種推理問題，包括GSM8K、GSM-Hard、MATH、SVAMP和StrategyQA。值得注意的是，rStar將LLaMA2-7B的GSM8K準確率從12.51%提升至63.91%，將Mistral-7B的準確率從36.46%提升至81.88%，將LLaMA3-8B-Instruct的準確率從74.53%提升至91.13%。代碼將在 https://github.com/zhentingqi/rStar 上提供。

English

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct. Code will be available at https://github.com/zhentingqi/rStar.

相互推理使較小的LLM更具解決問題能力

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

摘要

Support