RLP: 사전 학습 목표로서의 강화 학습

초록

대규모 추론 모델을 훈련하는 주요 패러다임은 방대한 양의 데이터에 대해 다음 토큰 예측 손실을 사용한 사전 훈련으로 시작합니다. 강화 학습은 추론을 확장하는 데 강력하지만, 지도 미세 조정 이후에 마지막 단계로 도입됩니다. 이러한 주요 패러다임이 최적의 훈련 방식일까요? 본 논문에서는 강화 학습의 핵심 정신인 탐색을 사전 훈련의 마지막 단계로 가져오는 정보 기반 강화 사전 훈련 목표인 RLP를 제시합니다. 핵심 아이디어는 사고의 연쇄를 탐색적 행동으로 간주하고, 이를 통해 미래 토큰 예측에 제공하는 정보 이득을 기반으로 보상을 계산하는 것입니다. 이 훈련 목표는 모델이 다음을 예측하기 전에 스스로 생각하도록 장려함으로써, 사전 훈련 초기에 독립적인 사고 행동을 가르칩니다. 구체적으로, 보상 신호는 컨텍스트와 샘플링된 추론 연쇄를 모두 고려했을 때 다음 토큰의 로그 가능성이 컨텍스트만 고려했을 때보다 증가하는 정도를 측정합니다. 이 접근 방식은 검증기 없이도 밀집된 보상 신호를 제공하여 사전 훈련 중 전체 문서 스트림에 대해 효율적인 훈련을 가능하게 합니다. 특히, RLP는 추론을 위한 강화 학습을 일반 텍스트에 대한 사전 훈련 목표로 재구성함으로써, 다음 토큰 예측과 유용한 사고의 연쇄 추론의 출현 사이의 간극을 메웁니다. Qwen3-1.7B-Base에 RLP를 적용한 사전 훈련은 8개 벤치마크 수학 및 과학 스위트에서 전체 평균을 19% 향상시켰습니다. 동일한 사후 훈련을 적용했을 때, AIME25 및 MMLU-Pro와 같은 추론 중심 작업에서 가장 큰 개선이 나타났습니다. 하이브리드 Nemotron-Nano-12B-v2에 RLP를 적용하면 전체 평균이 42.81%에서 61.32%로 증가하고, 과학적 추론 평균이 23% 상승하여 아키텍처와 모델 크기에 걸쳐 확장성을 입증했습니다.

English

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

RLP: 사전 학습 목표로서의 강화 학습

RLP: Reinforcement as a Pretraining Objective

초록

Support