通過隨機選擇的少樣本引導增強基於可驗證獎勵的強化學習

摘要

基於可驗證獎勵的強化學習（RLVR）已在開發大型語言模型（LLMs）方面取得重大成功，透過思維鏈展開應用於數學解題與程式編寫等多項任務。然而，RLVR在處理難以生成正確展開的困難問題時，面臨樣本效率不足的挑戰。先前研究提出透過示範引導式RLVR來解決此問題，即在強化學習失效時進行監督式微調（SFT）；然而，SFT通常需要大量數據，取得成本高昂。本文提出FEST（少量示範引導式RLVR演算法），僅需從SFT資料集中隨機選取128筆示範資料即可獲得顯著成效。我們發現三個關鍵要素決定其成功：監督訊號、同策略訊號，以及對少量示範SFT資料集採用衰減權重，以避免多輪訓練造成的過度擬合。在多項基準測試中，FEST以數量級更少的SFT資料超越基線方法，甚至在使用完整資料集時達到與之相當的表現。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.