BroRL: 확장된 탐색을 통한 강화 학습의 확장

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대규모 언어 모델에서 복잡한 추론 능력을 발휘하기 위한 핵심 요소로 부상하고 있다. 최근 연구인 ProRL은 훈련 단계 수를 증가시켜 강화 학습의 확장 가능성을 보여주었다. 그러나 수천 단계 이후에는 성능이 정체되며, 추가 훈련을 위해 더 많은 계산 자원을 할당해도 명확한 한계가 나타난다. 본 연구에서는 강화 학습을 확장하기 위한 보완적 패러다임인 BroRL을 탐구한다. 이는 각 예제당 롤아웃(rollout) 횟수를 수백 회로 늘려 탐색을 철저히 확장(Broaden)함으로써, ProRL에서 관찰된 포화점을 넘어 지속적인 성능 향상을 이끌어낸다. 우리의 접근법은 질량 균형 방정식 분석에 기반하여 강화 학습 과정에서 정답 토큰과 오답 토큰의 확률 질량 변화율을 특성화할 수 있도록 한다. 우리는 한 단계 강화 학습 가정 하에서, 샘플링된 롤아웃 토큰은 항상 정답 질량 확장에 기여하는 반면, 롤아웃 외부에서 샘플링되지 않은 토큰은 그 분포와 순 보상 균형에 따라 이득 또는 손실을 초래할 수 있음을 보인다. 중요한 것은, 예제당 롤아웃 횟수 N이 증가함에 따라 샘플링되지 않은 항목의 영향이 감소하여 전반적인 정답 질량 확장이 보장된다는 점이다. 우리의 이론적 분석을 검증하기 위해 더 완화된 조건 하에서 시뮬레이션을 수행하였으며, 충분히 큰 롤아웃 크기 N—즉, 충분한 탐색—이 모든 정답 토큰의 확률 질량 증가를 보장함을 확인하였다. 실험적으로, BroRL은 3,000단계 ProRL 훈련 이후 포화된 모델을 재활성시키고, 견고하고 지속적인 개선을 통해 1.5B 모델이 다양한 벤치마크에서 최신 기술 수준의 결과를 달성함을 입증하였다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.

BroRL: 확장된 탐색을 통한 강화 학습의 확장

BroRL: Scaling Reinforcement Learning via Broadened Exploration

초록

Support