推动大语言模型推理能力的边界

摘要

当前在线强化学习（RL）算法，如GRPO，在大型语言模型（LLM）推理中存在一个关键局限：它们无法从模型“无法解决”的问题中学习。换言之，这些算法仅能在模型能够探索出正确答案的问题上提升性能。因此，尽管RL训练后模型解决较易、可解问题的可能性增加，但其“上限”保持不变。这些难题样本无法贡献于训练，因为没有任何探索路径能产生奖励，也就无法生成梯度。为了解锁从这些难题样本中学习的能力，我们提出了NuRL，一种“助推”方法，旨在利用自生成的提示——即帮助模型降低问题难度的抽象线索——来推动LLM推理的上限。给定一个问题及其标准答案，模型生成一个推理链（CoT），随后产生一个包含解决问题所需核心知识的提示。训练过程中，我们从基础策略生成G条探索路径，并依据通过率决定是否注入提示。对于通过率为0%的难题样本，我们注入提示并重新生成一批轨迹。这带来两大益处：（1）提示提升了通过率（从0%到非零），从而为先前无法解决的样本引入了训练信号；（2）提示是自生成的，避免了分布偏移，且不依赖外部模型。NuRL在6个基准测试和3个模型上均实现了持续改进，同时与测试时扩展保持互补。值得注意的是，NuRL能够提升模型的上限，而GRPO则使pass@1024与基础模型相比保持不变。此外，我们系统性地研究了何为有效提示及提示何时最为有用。有趣的是，最佳提示是抽象且高层次的，且最有效的是在必要时应用，尤其是在GRPO收敛之后。

English

Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.

推动大语言模型推理能力的边界

Nudging the Boundaries of LLM Reasoning

摘要

Support