ExpRL: 대규모 언어 모델 중간 학습을 위한 탐색적 강화 학습

초록

희소 보상 강화 학습(Sparse Reward Reinforcement Learning, RL)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키기 위한 표준 도구가 되었으나, 그 성공은 기반 모델에 존재하는 적용 범위(coverage)에 결정적으로 의존한다. 실제로 모델은 종종 분해, 검증 또는 자기 수정과 같은 유용한 기본 기술을 가르치는 정제된 추론 흔적(curated reasoning traces)을 사용한 중간 훈련(mid-training)을 통해 RL에 대비된다. 이러한 전략은 효과적이지만, 모델이 무엇을 학습해야 하는지를 수동으로 지정해야 하며, 이러한 기본 적용 범위만으로는 이러한 기술들을 더 폭넓은 해결 전략으로 결합해야 하는 훨씬 더 어려운 문제에 충분한지 불분명하다. 우리는 더 자동화된 접근 방식, 즉 인간이 작성한 대규모 질문-답변 데이터 말뭉치를 사용한 RL 기반 중간 훈련을 연구한다. ExpRL이라는 우리의 방법은 참조 해답을 모방할 대상으로 취급하는 대신, 이를 보상 스캐폴드(reward scaffolds)로 사용한다. 즉, 참조 해답은 정책(policy)에 숨겨지고, 정책 내 추론 흔적을 평가하기 위한 문제별 채점 기준(rubrics)을 구성하는 데만 사용된다. 정책은 원래 문제 프롬프트에서 샘플링하며, LLM 평가자(judge)가 샘플링된 추론 흔적을 참조 해답과 비교하여 결과 수준(outcome-level) 또는 과정 수준(process-level)의 조밀 보상(dense rewards)을 할당한다. 이를 통해 ExpRL은 부분적 진전, 유용한 중간 축소, 그리고 희소한 최종 답변 보상이 종종 제대로 평가하지 못하는 생산적인 추론 행동을 강화할 수 있다. 어려운 수학 추론 과제에서 ExpRL은 SFT, 희소 보상 GRPO, 자기 증류(self-distillation)보다 더 강력한 RL 프라이밍(priming)을 제공하며, 이후의 희소 보상 RL을 위한 더 나은 초기화(initialization)를 제공한다. 추가적인 혼합 도메인 실험은 ExpRL이 기존의 수학 전용 설정을 넘어 확장될 수 있음을 시사한다.

English

Sparse reward reinforcement learning (RL) has become a standard tool for improving LLM reasoning, but its success depends critically on the coverage present in the base model. In practice, models are often primed for RL through mid-training on curated reasoning traces that teach useful primitive skills such as decomposition, verification, or self-correction. Although effective, this strategy requires manually specifying what the model should learn, and it remains unclear whether such primitive coverage is enough for much harder problems, which require combining these skills into broader solution strategies. We study a more automated approach: RL-based mid-training using large corpora of human-written question-answer data. Rather than treating reference solutions as targets to imitate, our method, ExpRL, uses them as reward scaffolds: references are hidden from the policy and used only to construct problem-specific grading rubrics for judging on-policy reasoning traces. The policy samples from the original problem prompt, while an LLM judge compares the sampled reasoning trace against the reference solution and assigns outcome-level or process-level dense rewards. This lets ExpRL reinforce partial progress, useful intermediate reductions, and productive reasoning behaviors that sparse final-answer rewards often fail to upweight. On challenging math reasoning tasks, ExpRL yields stronger RL priming than SFT, sparse-reward GRPO, and self-distillation, and provides a better initialization for subsequent sparse-reward RL. Additional mixed-domain experiments further suggest that ExpRL can extend beyond the original math-only setting.