LLM 추론의 경계를 넓히기

초록

현재 GRPO와 같은 온라인 강화 학습(RL) 알고리즘은 LLM 추론에서 중요한 한계를 공유합니다: 모델이 "해결 불가능"한 문제로부터 학습할 수 없다는 점입니다. 즉, 모델이 정답을 탐색할 수 있는 문제에 대해서만 성능을 개선할 수 있습니다. 결과적으로, RL 훈련 후에도 모델의 "상한선"은 변하지 않으며, 단지 해결 가능한 쉬운 문제를 풀 가능성만 높아질 뿐입니다. 이러한 어려운 샘플들은 훈련에 기여할 수 없는데, 어떤 롤아웃도 보상을 생성하지 않아 그래디언트가 발생하지 않기 때문입니다. 이러한 어려운 샘플로부터 학습을 가능하게 하기 위해, 우리는 NuRL이라는 "넛징(nudging)" 방법을 제안합니다. 이 방법은 자체 생성 힌트, 즉 모델이 문제 난이도를 줄이는 데 도움이 되는 추상적인 단서를 사용하여 LLM 추론의 상한선을 높이는 것을 목표로 합니다. 질문과 그에 대한 정답이 주어지면, 모델은 CoT(Chain-of-Thought)를 생성한 후 문제를 해결하는 데 필요한 핵심 지식을 포함한 힌트를 생성합니다. 훈련 중에는 기본 정책에서 G개의 롤아웃을 생성하고, 통과율을 기준으로 힌트를 주입할지 여부를 결정합니다. 통과율이 0%인 어려운 샘플의 경우, 힌트를 주입하고 새로운 배치의 궤적을 재생성합니다. 이는 두 가지 이점을 제공합니다: (1) 힌트가 통과율을 높여(0%에서 0이 아닌 값으로) 이전에 해결 불가능했던 샘플에 대한 훈련 신호를 도입하고, (2) 힌트가 자체 생성되어 분포 변화를 피하며 외부 모델에 의존하지 않습니다. NuRL은 6개의 벤치마크와 3개의 모델에서 일관된 개선을 달성하면서도 테스트 시 스케일링과 상호 보완적으로 작동합니다. 특히, NuRL은 모델의 상한선을 높일 수 있는 반면, GRPO는 기본 모델과 비교해 pass@1024를 변경하지 않습니다. 또한, 우리는 효과적인 힌트의 조건과 힌트가 가장 유용한 시점에 대한 체계적인 연구를 제시합니다. 흥미롭게도, 가장 효과적인 힌트는 추상적이고 높은 수준의 것이며, GRPO가 수렴한 후에 필요할 때 적용될 때 가장 유용합니다.

English

Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are "unsolvable" to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model's "upper limit" remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a "nudging" method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model's upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.

LLM 추론의 경계를 넓히기

Nudging the Boundaries of LLM Reasoning

초록

Support