降噪增声：基于指令纯化的强化学习推理方法

摘要

尽管可验证奖励的强化学习（RLVR）推动了大型语言模型的推理能力发展，但在有限采样预算下，其探索效率低下的问题仍制约着发展，导致复杂任务中采样成功率低且训练不稳定。我们发现许多探索失败并非源于问题难度，而是由少量引发干扰的提示词元所致。基于这一发现，我们提出低噪声采样框架（LENS）：首先通过识别并移除干扰词元进行提示净化，随后将净化过程中的成功采样结果迁移至原始含噪提示，监督策略优化过程，使模型学会在真实世界的含噪提示环境中忽略干扰。实验表明，LENS显著优于GRPO，实现了更高性能与更快收敛——平均性能提升3.88%，收敛速度加快1.6倍以上。本研究揭示了剪枝干扰词元对提升采样效率的关键作用，为RLVR研究提供了新视角。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6times speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.

降噪增声：基于指令纯化的强化学习推理方法

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

摘要

Support