ReFT:强化微调推理
ReFT: Reasoning with Reinforced Fine-Tuning
January 17, 2024
作者: Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li
cs.AI
摘要
为了增强大型语言模型(LLMs)的推理能力,一种方法是利用链式思维(CoT)注释进行监督微调(SFT)。然而,这种方法并没有展现出足够强大的泛化能力,因为训练仅依赖于给定的CoT数据。例如,在数学问题解决中,通常在训练数据中每个问题只有一条注释的推理路径。直觉上,让算法从给定问题学习多条注释的推理路径会更好。为了解决这个问题,我们提出了一种简单而有效的方法,称为强化微调(ReFT),以增强学习LLMs用于推理的泛化能力,以数学问题解决为例。ReFT首先通过SFT对模型进行预热,然后采用在线强化学习,特别是本文中的PPO算法,进一步微调模型,其中根据问题自动采样大量推理路径,并且奖励自然地来自地面真实答案。在GSM8K、MathQA和SVAMP数据集上进行的大量实验表明,ReFT明显优于SFT,并且通过结合推理时策略(如多数投票和重新排序)可能进一步提升性能。值得注意的是,ReFT通过从与SFT相同的训练问题中学习而无需依赖额外或增广的训练问题来获得改进。这表明ReFT具有更强的泛化能力。
English
One way to enhance the reasoning capability of Large Language Models (LLMs)
is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT)
annotations. This approach does not show sufficiently strong generalization
ability, however, because the training only relies on the given CoT data. In
math problem-solving, for example, there is usually only one annotated
reasoning path for each question in the training data. Intuitively, it would be
better for the algorithm to learn from multiple annotated reasoning paths given
a question. To address this issue, we propose a simple yet effective approach
called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of
learning LLMs for reasoning, with math problem-solving as an example. ReFT
first warmups the model with SFT, and then employs on-line reinforcement
learning, specifically the PPO algorithm in this paper, to further fine-tune
the model, where an abundance of reasoning paths are automatically sampled
given the question and the rewards are naturally derived from the ground-truth
answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that
ReFT significantly outperforms SFT, and the performance can be potentially
further boosted by combining inference-time strategies such as majority voting
and re-ranking. Note that ReFT obtains the improvement by learning from the
same training questions as SFT, without relying on extra or augmented training
questions. This indicates a superior generalization ability for ReFT.