안전지대 너머로의 넛지: RLVR을 위한 효율적인 전략 기반 탐색

초록

검증 가능한 보상 기반 강화학습(RLVR)은 대규모 언어 모델의 추론 능력을 향상시키기 위한 확장 가능한 패러다임으로 부상했다. 그러나 그 효과성은 탐색에 의해 근본적으로 제한된다. 정책은 이미 샘플링한 궤적에 대해서만 개선될 수 있기 때문이다. 롤아웃 수를 늘리는 것이 이러한 문제를 완화하지만, 이러한 brute-force 방식의 확장은 계산 비용이 많이 들며, 최적화 목표를 수정하는 기존 접근법들은 무엇을 탐색할지에 대한 제어가 제한적이다. 본 연구에서는 RLVR에서 구조화되고 다양성을 촉진하는 탐색을 위한 프레임워크인 NudgeRL을 제안한다. 우리의 접근 방식은 전략 넛징(Strategy Nudging)을 도입하여, 각 롤아웃을 경량화된 전략 수준의 컨텍스트에 조건화함으로써 고비용의 오라클 감독에 의존하지 않고 다양한 추론 궤적을 유도한다. 이러한 구조화된 탐색으로부터 효과적으로 학습하기 위해, 우리는 보상 신호를 컨텍스트 간 및 컨텍스트 내 구성 요소로 분해하고, 발견된 행동을 기본 정책으로 전이하기 위한 증류 목표를 통합하는 통합 목표 함수를 추가로 제안한다. 실험적으로 NudgeRL은 최대 8배 더 큰 롤아웃 예산을 사용하는 표준 GRPO보다 우수한 성능을 보였으며, 다섯 가지 어려운 수학 벤치마크에서 평균적으로 오라클 기반 강화학습 기준선을 능가했다. 이러한 결과는 구조화되고 컨텍스트 기반의 탐색이 brute-force 롤아웃 확장 및 특권 정보에 기반한 실현 가능성 중심 방법 모두에 대한 효율적이고 확장 가능한 대안이 될 수 있음을 보여준다. 코드는 https://github.com/tally0818/NudgeRL에서 확인할 수 있다.

English

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.