ResRL：基于负样本投影残差强化学习提升大语言模型推理能力

摘要

基于可验证奖励的强化学习（RLVR）虽能增强大语言模型（LLM）的推理能力，但常因对正向奖励的过度激励导致生成多样性受限。尽管负样本强化（NSR）等方法通过加强负样本惩罚来缓解该问题，但可能抑制正负响应间共有的语义分布。为在提升推理能力的同时保持多样性，本文提出负样本投影残差强化学习（ResRL），通过解耦正负响应中的相似语义分布实现优化。我们理论推导了惰性似然位移（LLD）与正负头梯度干扰的关联，并构建单前向代理量以约束表征对齐的上界，从而指导保守优势重加权。ResRL将负样本词元的隐层表示投影至基于奇异值分解（SVD）的低秩正样本子空间，并利用投影残差调制负梯度，在数学、代码、智能体任务和函数调用等十二个基准测试中均优于强基线模型，实现推理能力与多样性的协同提升。值得注意的是，在数学推理任务上，ResRL以Avg@16指标超越NSR达9.4%，Pass@128指标提升7.0%。代码已开源：https://github.com/1229095296/ResRL.git。

English

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.

ResRL：基于负样本投影残差强化学习提升大语言模型推理能力

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

摘要

Support