Nudging jenseits der Komfortzone: Effiziente strategiegeführte Exploration für RLVR

Zusammenfassung

Bestärkungslernen mit verifizierbaren Belohnungen (RLVR) hat sich als skalierbares Paradigma zur Verbesserung der Reasoning-Fähigkeiten großer Sprachmodelle etabliert. Seine Wirksamkeit wird jedoch grundlegend durch die Exploration eingeschränkt: Die Policy kann sich nur auf bereits erprobten Trajektorien verbessern. Zwar mildert eine Erhöhung der Anzahl von Rollouts dieses Problem, doch ist eine solche skalierte Brute-Force-Methode rechenintensiv, und bestehende Ansätze, die das Optimierungsziel modifizieren, bieten nur begrenzte Kontrolle darüber, was exploriert wird. In dieser Arbeit schlagen wir NudgeRL vor, ein Framework für strukturierte und diversitätsgetriebene Exploration in RLVR. Unser Ansatz führt Strategy Nudging ein, bei dem jeder Rollout durch leichte, strategiebezogene Kontexte konditioniert wird, um vielfältige Reasoning-Trajektorien zu erzeugen – ohne auf teure Orakelüberwachung angewiesen zu sein. Um aus dieser strukturierten Exploration effektiv zu lernen, schlagen wir zudem ein einheitliches Ziel vor, das das Belohnungssignal in kontextübergreifende und kontextinterne Komponenten zerlegt und ein Destillationsziel integriert, um entdeckte Verhaltensweisen auf die Basis-Policy zu übertragen. Empirisch übertrifft NudgeRL das standardmäßige GRPO mit bis zu 8-fach größeren Rollout-Budgets und schlägt im Durchschnitt über fünf anspruchsvolle Mathematik-Benchmarks eine orakelgesteuerte RL-Baseline. Diese Ergebnisse zeigen, dass strukturierte, kontextgetriebene Exploration als effiziente und skalierbare Alternative sowohl zum Brute-Force-Rollout-Scaling als auch zu machbarkeitsorientierten Methoden basierend auf privilegierter Informationen dienen kann. Unser Code ist verfügbar unter https://github.com/tally0818/NudgeRL.

English

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.