ReFT：强化微调推理

摘要

为了增强大型语言模型（LLMs）的推理能力，一种方法是利用链式思维（CoT）注释进行监督微调（SFT）。然而，这种方法并没有展现出足够强大的泛化能力，因为训练仅依赖于给定的CoT数据。例如，在数学问题解决中，通常在训练数据中每个问题只有一条注释的推理路径。直觉上，让算法从给定问题学习多条注释的推理路径会更好。为了解决这个问题，我们提出了一种简单而有效的方法，称为强化微调（ReFT），以增强学习LLMs用于推理的泛化能力，以数学问题解决为例。ReFT首先通过SFT对模型进行预热，然后采用在线强化学习，特别是本文中的PPO算法，进一步微调模型，其中根据问题自动采样大量推理路径，并且奖励自然地来自地面真实答案。在GSM8K、MathQA和SVAMP数据集上进行的大量实验表明，ReFT明显优于SFT，并且通过结合推理时策略（如多数投票和重新排序）可能进一步提升性能。值得注意的是，ReFT通过从与SFT相同的训练问题中学习而无需依赖额外或增广的训练问题来获得改进。这表明ReFT具有更强的泛化能力。

English

One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.

ReFT：强化微调推理

ReFT: Reasoning with Reinforced Fine-Tuning

摘要

Support