助推超越舒适区:RLVR的高效策略引导探索
Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR
May 15, 2026
作者: Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的可扩展范式。然而,其有效性从根本上受限于探索环节:策略只能在已采样的轨迹上实现改进。虽然增加轨迹采样数量可缓解这一问题,但这种暴力扩展方式计算成本高昂,且现有通过修改优化目标的方法对探索过程的控制十分有限。本文提出NudgeRL框架,这是一种面向RLVR的结构化多样性驱动探索方法。我们的方法引入"策略引导"机制,通过为每条采样轨迹附加轻量级策略级上下文条件,在无需昂贵 oracle 监督的情况下生成多样化推理轨迹。为有效学习这种结构化探索,我们进一步提出统一目标函数,将奖励信号分解为上下文间与上下文内组件,并融入蒸馏目标将发现的策略行为迁移回基础策略。实验表明,NudgeRL在五个具有挑战性的数学基准测试中,平均性能优于采用最高8倍采样预算的标准GRPO,且超越基于oracle引导的RL基线。这些结果证明,结构化、上下文驱动的探索可作为暴力扩展采样与基于特权信息的可行性导向方法的有效且可扩展替代方案。我们的代码已开源:https://github.com/tally0818/NudgeRL。
English
Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.