コンフォートゾーンを超えたナッジング：RLVRのための効率的な戦略誘導型探索

要旨

検証可能報酬による強化学習（RLVR）は、大規模言語モデルの推論能力を向上させるためのスケーラブルなパラダイムとして登場した。しかし、その有効性は基本的に探索によって制限されている。すなわち、方策は既にサンプリングした軌跡上でのみ改善できる。ロールアウトの数を増やすことでこの問題は緩和されるが、そのような力任せのスケーリングは計算コストが高く、また最適化目的を変更する既存のアプローチでは、何が探索されるかに対する制御が限られている。本研究では、RLVRにおける構造化され多様性に駆動された探索のためのフレームワークであるNudgeRLを提案する。我々のアプローチは戦略誘導（Strategy Nudging）を導入する。これは各ロールアウトを軽量な戦略レベルのコンテキストに条件づけることで、高コストなオラクル監視に依存せずに多様な推論軌跡を誘導する。このような構造化された探索から効果的に学習するために、さらに統一目的関数を提案する。これは報酬信号をコンテキスト間およびコンテキスト内成分に分解し、発見された振る舞いを基本方策に転送するための蒸留目的関数を組み込む。実験的に、NudgeRLは最大8倍のロールアウト予算を持つ標準GRPOを上回り、また5つの挑戦的な数学ベンチマーク全体で平均してオラクル誘導RLベースラインを上回る。これらの結果は、構造化されたコンテキスト駆動型探索が、力任せのロールアウトスケーリングと特権情報に基づく実現可能性指向手法の両方に対する効率的かつスケーラブルな代替手段となり得ることを示している。我々のコードは https://github.com/tally0818/NudgeRL で公開されている。

English

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.