GLoRe：通过全局和局部的改进来提高LLM推理的时间、位置和方式

摘要

最先进的语言模型在数学、科学或编码任务上展现出令人印象深刻的推理细化能力。然而，最近的研究表明，即使是最好的模型也很难在没有外部反馈的情况下确定何时以及在哪里进行细化。基于结果的奖励模型（ORMs），训练以预测最终答案的正确性，指示何时进行细化，为确定何时进行细化提供了一种便利的解决方案。基于过程的奖励模型（PRMs），训练以预测中间步骤的正确性，然后可用于指示何处进行细化。但它们的训练成本很高，需要大量人工标注。在本文中，我们提出了逐步ORMs（SORMs），它们仅在合成数据上训练，以近似预期未来奖励的最佳策略或V^{star}。更具体地说，SORMs被训练以在多次采样当前策略时预测最终答案的正确性（而不仅仅像ORMs那样只进行一次）。我们的实验表明，与ORMs相比，SORMs能更准确地检测出错误的推理步骤，从而在进行细化时提高下游准确性。然后，我们训练了全局细化模型，仅以问题和草稿解决方案作为输入，并预测出修正后的解决方案，以及本地细化模型，还以指示第一个推理错误位置的批评作为输入。我们通过重新使用用于训练SORM的数据来合成为这两种模型生成训练数据。我们发现，结合全局和本地细化，使用ORM作为重新排序器，明显优于单独使用任何一种模型，以及三种样本基线中的最佳表现。通过这种策略，我们可以将已经通过RL进行精调的LLaMA-2 13B模型在GSM8K上的准确率从53%提高到65%。

English

State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify when and where to refine without access to external feedback. Outcome-based Reward Models (ORMs), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (PRMs), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (SORMs) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or V^{star}. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train global refinement models, which take only the question and a draft solution as input and predict a corrected solution, and local refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.

GLoRe：通过全局和局部的改进来提高LLM推理的时间、位置和方式

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

摘要

Support