ChatPaper.aiChatPaper

推離舒適區:RLVR的高效策略引導探索

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

May 15, 2026
作者: Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang
cs.AI

摘要

具可驗證獎勵之強化學習(RLVR)已成為一種可擴展的範式,用於提升大型語言模型的推理能力。然而,其有效性本質上受到探索的限制:策略只能在已採樣的軌跡上進行改進。雖然增加軌跡採樣數量能緩解此問題,但這種暴力擴展的計算成本高昂,而現有修改優化目標的方法對探索內容的控制有限。在此工作中,我們提出 NudgeRL,一個用於 RLVR 中結構化且以多樣性驅動之探索的框架。我們的方法引入策略提示(Strategy Nudging),將每個軌跡採樣條件於輕量級的策略層級上下文之上,以產生多樣化的推理軌跡,無需依賴昂貴的專家監督。為了有效地從此類結構化探索中學習,我們進一步提出一個統一的目標函數,將獎勵訊號分解為上下文間與上下文內分量,並加入一項蒸餾目標,將發現的行為遷移回基礎策略。實驗中,NudgeRL 優於使用高達 8 倍軌跡採樣預算的標準 GRPO,同時在五個具挑戰性的數學基準測試中平均優於專家引導的強化學習基線。這些結果表明,結構化的上下文驅動探索可作為暴力軌跡擴展以及基於特權資訊之可行性導向方法的有效且可擴展替代方案。我們的程式碼現已開源,網址為 https://github.com/tally0818/NudgeRL。
English
Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.