강화 사전 학습

초록

본 연구에서는 대규모 언어 모델과 강화 학습(RL)을 위한 새로운 확장 패러다임으로서 강화 사전 학습(Reinforcement Pre-Training, RPT)을 소개합니다. 구체적으로, 다음 토큰 예측을 RL을 통해 학습되는 추론 작업으로 재구성하여, 주어진 문맥에서 다음 토큰을 정확히 예측할 때 검증 가능한 보상을 제공합니다. RPT는 도메인 특화된 주석 답변에 의존하는 대신, 방대한 양의 텍스트 데이터를 일반적인 목적의 RL에 활용할 수 있는 확장 가능한 방법을 제공합니다. 다음 토큰 추론 능력을 강화함으로써, RPT는 다음 토큰 예측의 언어 모델링 정확도를 크게 향상시킵니다. 또한, RPT는 추가적인 강화 미세 조정을 위한 강력한 사전 학습 기반을 제공합니다. 확장 곡선은 훈련 계산량이 증가함에 따라 다음 토큰 예측 정확도가 지속적으로 개선됨을 보여줍니다. 이러한 결과는 RPT를 언어 모델 사전 학습을 발전시키는 효과적이고 유망한 확장 패러다임으로 자리매김합니다.

English

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.