GLoRe：通過全局和局部的改進來提升LLM推理的時間、地點和方式

摘要

最先進的語言模型在數學、科學或編碼任務上展現出令人印象深刻的推理精煉能力。然而，最近的研究表明，即使是最佳模型在沒有外部反饋的情況下，也很難確定何時以及在哪裡進行精煉。基於結果的獎勵模型（ORMs）被訓練來預測最終答案的正確性，指示何時進行精煉，為確定何時進行精煉提供了一個便利的解決方案。基於過程的獎勵模型（PRMs）被訓練來預測中間步驟的正確性，然後可用於指示何處進行精煉。但是它們的訓練成本很高，需要大量人工標註。在本文中，我們提出了逐步ORMs（SORMs），它們僅在合成數據上訓練，以近似預期未來獎勵的最佳策略或 V^{star}。更具體地說，SORMs 被訓練來預測在多次採樣當前策略時（而不僅像ORMs那樣只採樣一次）最終答案的正確性。我們的實驗表明，與ORMs相比，SORMs能更準確地檢測不正確的推理步驟，從而在進行精煉時提高下游準確性。然後，我們訓練全局精煉模型，該模型僅將問題和初步解決方案作為輸入，並預測出一個更正確的解決方案，以及局部精煉模型，它們還將指示第一個推理錯誤位置的評論作為輸入。我們通過重複使用用於訓練SORM的數據來合成為這兩種模型生成訓練數據。我們發現將全局和局部精煉結合起來，使用ORM作為重新排序器，明顯優於單獨使用其中任何一種，以及三個樣本基準中的最佳表現。通過這種策略，我們可以將已經使用RL進行微調的LLaMA-2 13B模型在GSM8K上的準確性（貪婪採樣時）從53%提高到65%。

English

State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify when and where to refine without access to external feedback. Outcome-based Reward Models (ORMs), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (PRMs), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (SORMs) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or V^{star}. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train global refinement models, which take only the question and a draft solution as input and predict a corrected solution, and local refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.

GLoRe：通過全局和局部的改進來提升LLM推理的時間、地點和方式

GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements

摘要

Support