대규모 추론 모델을 위한 강화 학습 연구 동향

초록

본 논문에서는 대규모 언어 모델(LLM)의 추론 능력을 강화하기 위한 강화학습(RL)의 최근 발전을 조사한다. RL은 특히 수학 및 코딩과 같은 복잡한 논리적 과제를 해결하는 데 있어 LLM의 능력 한계를 확장하는 데 있어 주목할 만한 성과를 거두었다. 그 결과, RL은 LLM을 LRM으로 변환하는 데 있어 핵심적인 방법론으로 자리 잡았다. 해당 분야의 급속한 발전과 함께, LRM을 위한 RL의 확장은 이제 계산 자원뿐만 아니라 알고리즘 설계, 학습 데이터, 인프라 측면에서도 근본적인 도전에 직면해 있다. 이에 따라, 이 분야의 발전을 재검토하고, 그 궤적을 재평가하며, 인공 초지능(ASI)을 향한 RL의 확장성을 강화하기 위한 전략을 탐구하는 것이 시의적절하다. 특히, DeepSeek-R1 출시 이후 LLM 및 LRM의 추론 능력에 RL을 적용한 연구를 검토하며, 핵심 구성 요소, 주요 문제, 학습 자원, 하위 응용 프로그램 등을 포함하여 이 빠르게 진화하는 분야의 미래 기회와 방향을 식별한다. 본 리뷰가 보다 광범위한 추론 모델을 위한 RL 연구를 촉진할 수 있기를 바란다. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

English

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models. Github: https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

대규모 추론 모델을 위한 강화 학습 연구 동향

A Survey of Reinforcement Learning for Large Reasoning Models

초록

Support