ChatPaper.aiChatPaper

降噪增声:基于指令净化的强化学习推理方法

Less Noise, More Voice: Reinforcement Learning for Reasoning via Instruction Purification

January 29, 2026
作者: Yiju Guo, Tianyi Hu, Zexu Sun, Yankai Lin
cs.AI

摘要

基于可验证奖励的强化学习(RLVR)虽已推动大语言模型推理能力发展,但在有限采样预算下仍受低效探索制约,导致复杂任务中采样成功率低且训练不稳定。我们发现许多探索失败并非源于问题难度,而是由少数引发干扰的提示词所致。基于此,我们提出低噪声采样框架(LENS):首先通过识别并剔除干扰词实现提示净化,随后将净化过程中的成功采样结果迁移至原始含噪提示,监督策略优化过程,使模型学会在真实噪声提示环境中忽略干扰。实验表明LENS显著优于GRPO,性能提升3.88%且收敛速度加快1.6倍以上。本研究揭示了剔除干扰词对提升采样效率的关键作用,为RLVR研究提供了新视角。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has advanced LLM reasoning, but remains constrained by inefficient exploration under limited rollout budgets, leading to low sampling success and unstable training in complex tasks. We find that many exploration failures arise not from problem difficulty, but from a small number of prompt tokens that introduce interference. Building on this insight, we propose the Less Noise Sampling Framework (LENS), which first prompts by identifying and removing interference tokens. then transfers successful rollouts from the purification process to supervise policy optimization on the original noisy prompts, enabling the model to learn to ignore interference in the real-world, noisy prompting settings. Experimental results show that LENS significantly outperforms GRPO, delivering higher performance and faster convergence, with a 3.88% average gain and over 1.6times speedup. Our work highlights the critical role of pruning interference tokens in improving rollout efficiency, offering a new perspective for RLVR research.
PDF122February 5, 2026