강화 학습이 할 수 없는 것 학습: 가장 어려운 질문을 위한 인터리브 온라인 미세 조정

초록

대규모 언어 모델(LLM)의 추론 능력에 관한 최근 연구는 강화 학습(RL)을 통해 계획 및 자기 성찰과 같은 정교한 행동이 나타날 수 있음을 보여주었습니다. 그러나 이러한 성과에도 불구하고, 현재 형태의 RL은 기본 모델의 한계를 초과하는 능력을 유도하기에는 여전히 부족합니다. 이는 RL이 주로 모델의 기존 지식을 기반으로 최적화되며, 새로운 정보의 습득을 촉진하지 않기 때문입니다. 이러한 한계를 해결하기 위해, 우리는 RL이 학습할 수 없는 부분을 슈퍼바이즈드 파인튜닝(SFT)을 통해 학습함으로써 고품질 데모 데이터를 활용하여 새로운 지식과 추론 패턴을 통합할 수 있도록 합니다. 우리는 LLM 추론을 위한 RL과 SFT의 학습 역학을 분석한 결과, RL은 모델의 원래 능력 범위 내 질문에 대한 성능을 유지하고 개선하는 데 탁월한 반면, SFT는 모델의 현재 범위를 넘어서는 질문에 대한 진전을 가능하게 하는 데 더 효과적임을 발견했습니다. RL과 SFT의 상호 보완적 강점에 동기를 받아, 우리는 새로운 학습 접근법인 ReLIFT(Reinforcement Learning Interleaved with Online Fine-Tuning)를 제안합니다. ReLIFT에서는 모델이 주로 RL을 통해 학습되지만, 어려운 질문에 직면할 때 고품질 솔루션을 수집하여 파인튜닝을 진행하며, RL과 파인튜닝을 번갈아가며 모델의 추론 능력을 향상시킵니다. ReLIFT는 다섯 가지 경쟁 수준 벤치마크와 하나의 분포 외 벤치마크에서 다른 제로-RL 모델 대비 평균 +5.2점 이상의 개선을 달성했습니다. 또한, ReLIFT는 상세한 데모 데이터의 13%만 사용하면서도 RL과 SFT를 모두 능가하는 성능을 보여주어 확장성을 입증했습니다. 이러한 결과는 ReLIFT가 RL의 근본적인 한계를 극복하며 상당한 잠재력을 가지고 있음을 강력하게 시사합니다.

English

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

강화 학습이 할 수 없는 것 학습: 가장 어려운 질문을 위한 인터리브 온라인 미세 조정

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

초록

Support