강화 학습이 정말로 LLM의 기본 모델을 넘어서는 추론 능력을 촉진하는가?

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 최근 LLM(대형 언어 모델)의 수학 및 프로그래밍 과제에서의 추론 능력을 향상시키는 데 있어 주목할 만한 성과를 보여주었습니다. 일반적으로 RLVR은 LLM이 지속적으로 자기 개선을 통해 기본 모델의 능력을 뛰어넘는 새로운 추론 능력을 획득할 수 있게 한다고 여겨집니다. 그러나 본 연구에서는 이러한 가정을 비판적으로 재검토하기 위해, 다양한 모델 패밀리와 벤치마크에 걸쳐 모델의 추론 능력 한계를 탐구하기 위해 큰 k 값으로 pass@k 지표를 측정했습니다. 놀랍게도, 강화 학습은 근본적으로 새로운 추론 패턴을 이끌어내지 못했습니다. 강화 학습으로 훈련된 모델은 작은 k 값(예: k=1)에서 기본 모델을 능가하지만, 큰 k 값에서는 기본 모델이 강화 학습 모델과 비슷하거나 더 높은 pass@k 점수를 달성할 수 있었습니다. 강화 학습 모델이 생성한 추론 경로는 이미 기본 모델의 샘플링 분포에 포함되어 있으며, 이는 강화 학습 모델에서 나타나는 대부분의 추론 능력이 기본 모델에서 이미 획득된 것임을 시사합니다. 추가 분석에 따르면, 강화 학습 훈련은 모델의 출력 분포를 보상 확률이 높은 경로로 편향시켜 성능을 향상시키지만, 이는 기본 모델에 비해 더 좁은 추론 능력 한계를 초래합니다. RLVR로 훈련된 시각적 추론 과제에서도 유사한 결과가 관찰되었습니다. 또한, 증류(distillation)는 RLVR과 달리 모델에 진정으로 새로운 지식을 도입할 수 있음을 발견했습니다. 이러한 결과는 LLM의 추론 능력을 발전시키는 데 있어 RLVR의 중요한 한계를 강조하며, 추론 LLM에서의 강화 학습 훈련의 영향과 더 나은 패러다임의 필요성을 근본적으로 재고할 것을 요구합니다. 프로젝트 페이지: https://limit-of-RLVR.github.io

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

강화 학습이 정말로 LLM의 기본 모델을 넘어서는 추론 능력을 촉진하는가?

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

초록

Support