拓展大型語言模型推理的邊界

摘要

現有的線上強化學習（RL）算法，如GRPO，在大型語言模型（LLM）推理中存在一個關鍵限制：它們無法從模型認為「無解」的問題中學習。換句話說，這些算法只能在模型能夠探索出正確答案的問題上提升表現。因此，即使模型在解決較簡單、可解問題的可能性有所增加，其「上限」在RL訓練後仍保持不變。這些難題樣本無法貢獻於訓練，因為沒有任何rollout能產生獎勵，從而無法生成梯度。為了實現從這些難題樣本中學習，我們提出了NuRL，一種「微調」方法，旨在利用自我生成的提示（即幫助降低問題難度的抽象線索）來推動LLM推理的上限。給定一個問題及其標準答案，模型生成一個思維鏈（CoT），然後產生一個包含解決問題所需核心知識的提示。在訓練過程中，我們從基礎策略生成G個rollout，並使用通過率來決定是否應注入提示。對於通過率為0%的難題樣本，我們注入提示並重新生成一批軌跡。這帶來兩個好處：（1）提示提高了通過率（從0%到非零），從而為之前無解的樣本引入了訓練信號；（2）提示是自我生成的，避免了分佈偏移，且不依賴外部模型。NuRL在6個基準測試和3個模型上實現了一致的改進，同時與測試時的擴展保持互補。值得注意的是，NuRL能夠提升模型的上限，而GRPO則使pass@1024與基礎模型相比保持不變。此外，我們系統地研究了什麼是有效的提示以及提示在何時最有用。有趣的是，最佳的提示是抽象且高層次的，並且在必要時及GRPO收斂後應用時最為有益。

English

Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.

拓展大型語言模型推理的邊界

Nudging the Boundaries of LLM Reasoning

摘要

Support