무작위 선택된 퓨샷 안내를 통한 검증 가능한 보상 강화 학습 향상

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 수학 및 코딩과 같은 여러 작업에 대해 사고 사슬(chain-of-thought) 롤아웃을 활용하는 대규모 언어 모델(LLM) 개발에서 큰 성공을 거두었다. 그럼에도 불구하고 RLVR은 올바른 롤아웃을 생성하기 어려운 난이도 높은 문제에서 샘플 효율성에 어려움을 겪는다. 선행 연구들은 이 문제를 시연 기반 RLVR, 즉 강화 학습이 실패할 때 지도 미세 조정(SFT)을 수행하는 방식으로 해결하고자 하였으나, SFT는 종종 많은 데이터를 필요로 하여 확보 비용이 높을 수 있다. 본 논문에서는 FEST(FEw-ShoT 시연 기반 RLVR 알고리즘)를 제안한다. 이 알고리즘은 SFT 데이터셋에서 무작위로 선별된 128개의 시연만으로도 탁월한 결과를 달성한다. 성공에 중요한 세 가지 구성 요소는 지도 신호, 온-폴리시 신호, 그리고 다중 에폭 훈련에서의 과적합을 방지하기 위한 퓨샷 SFT 데이터셋에 대한 감쇠 가중치임을 발견하였다. 여러 벤치마크에서 FEST는 훨씬 적은 양의 SFT 데이터로도 기준 방법들을 능가하며, 전체 데이터셋을 사용한 성능과도 일치한다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.