ランダム選択された少数ショットガイダンスによる検証可能な報酬を用いた強化学習の促進

要旨

検証可能な報酬を用いた強化学習（RLVR）は、数学やコーディングなどの多くのタスクにおいて、思考連鎖ロールアウトを伴う大規模言語モデル（LLM）の開発に大きな成功を収めている。しかしながら、RLVRは、正しいロールアウトを生成することが困難な難易度の高い問題において、サンプル効率の面で課題を抱えている。先行研究では、この問題に対処するためにデモンストレーション誘導型RLVR、すなわちRLが失敗した際に教師ありファインチューニング（SFT）を実施する手法が提案されている。しかし、SFTは多くのデータを必要とすることが多く、その取得にはコストがかかる可能性がある。本論文では、FEST（FEw-ShoTデモンストレーション誘導型RLVRアルゴリズム）を提案する。FESTは、SFTデータセットからランダムに選択されたわずか128個のデモンストレーションで魅力的な結果を達成する。その成功には、教師信号、オン方策信号、そして複数エポック学習による過学習を防ぐための少数ショットSFTデータセットに対する減衰重みの3つの要素が重要であることが明らかになった。複数のベンチマークにおいて、FESTははるかに少ないSFTデータでベースラインを上回り、完全なデータセットを使用した場合と同等の性能を示す。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.