ReFT：利用強化微調進行推理

摘要

為了增強大型語言模型（LLMs）的推理能力，一種方法是使用Chain-of-Thought（CoT）標註進行監督微調（SFT）。然而，這種方法並沒有展現出足夠強的泛化能力，因為訓練只依賴於給定的CoT數據。例如，在數學問題解決中，通常在訓練數據中每個問題只有一條標註的推理路徑。直覺上，對於算法來說，讓其從多條標註的推理路徑中學習會更好。為了解決這個問題，我們提出了一種簡單而有效的方法，稱為強化微調（ReFT），以增強學習LLMs進行推理的泛化能力，以數學問題解決為例。ReFT首先通過SFT對模型進行預熱，然後採用在線強化學習，特別是本文中的PPO算法，進一步微調模型，其中根據問題自動採樣大量推理路徑，獎勵自地真實答案中自然產生。對GSM8K、MathQA和SVAMP數據集的大量實驗表明，ReFT明顯優於SFT，並且通過結合推理時策略（如多數投票和重新排名）潛在地進一步提升性能。需要注意的是，ReFT通過從與SFT相同的訓練問題中學習而獲得改進，而無需依賴額外或擴充的訓練問題。這表明ReFT具有更優越的泛化能力。

English

One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.

ReFT：利用強化微調進行推理

ReFT: Reasoning with Reinforced Fine-Tuning

摘要

Support