ResRL：基于负样本投影残差强化学习的大语言模型推理能力增强方法

摘要

针对可验证奖励的强化学习（RLVR）虽能增强大语言模型的推理能力，但常因过度激励正奖励导致生成多样性受限。尽管负样本强化（NSR）等方法通过加强负样本惩罚缓解该问题，却可能抑制正负响应共有的语义分布。为在提升推理能力的同时保持多样性，本文提出负样本投影残差强化学习（ResRL），通过解耦正负响应间的相似语义分布实现双目标优化。我们从理论上将懒惰似然位移（LLD）与正负头梯度干扰相关联，推导出表征对齐上界的单前向代理指标，以指导保守优势重加权。ResRL将负标记隐藏表征投影至基于SVD的低秩正子空间，利用投影残差调制负梯度，在十二个涵盖数学、代码、智能体任务和函数调用的基准测试中，平均实现推理能力提升与多样性保持，且优于强基线。值得注意的是，ResRL在数学推理上以Avg@16指标超越NSR达9.4%，Pass@128指标领先7.0%。代码已开源：https://github.com/1229095296/ResRL.git。

English

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.