파트 I: 트릭인가, 함정인가? LLM 추론을 위한 RL 심층 분석

초록

LLM 추론을 위한 강화 학습은 알고리즘 혁신과 실제 응용 분야 모두에서 관련 연구가 급증하며 주요 연구 영역으로 빠르게 부상하고 있습니다. 이러한 진전에도 불구하고, RL 기술을 적용하기 위한 표준화된 가이드라인의 부재와 그 기저 메커니즘에 대한 파편화된 이해를 포함한 여러 중요한 과제가 남아 있습니다. 또한, 일관되지 않은 실험 설정, 훈련 데이터의 변동, 모델 초기화의 차이 등으로 인해 상충되는 결론이 도출되며, 이러한 기술의 핵심 특성을 흐리게 하고 실무자들이 적절한 기술을 선택하는 데 혼란을 야기하고 있습니다. 본 논문은 통합된 오픈소스 프레임워크 내에서 엄격한 재현과 분리된 평가를 통해 널리 채택된 RL 기술을 체계적으로 검토합니다. 다양한 난이도의 데이터셋, 모델 크기, 아키텍처를 포함한 세분화된 실험을 통해 각 기술의 내부 메커니즘, 적용 가능한 시나리오, 핵심 원리를 분석합니다. 이러한 통찰을 바탕으로, 특정 설정에 맞춰 RL 기술을 선택하기 위한 명확한 가이드라인을 제시하고, LLM 도메인에서 RL을 활용하는 실무자들을 위한 신뢰할 수 있는 로드맵을 제공합니다. 마지막으로, 두 가지 기술의 미니멀리스트적 조합이 기본 PPO 손실을 사용하여 비평가 정책의 학습 능력을 개방할 수 있음을 밝힙니다. 결과는 우리의 단순한 조합이 GRPO 및 DAPO와 같은 전략을 능가하며 일관되게 성능을 향상시킴을 보여줍니다.

English

Reinforcement learning for LLM reasoning has rapidly emerged as a prominent research area, marked by a significant surge in related studies on both algorithmic innovations and practical applications. Despite this progress, several critical challenges remain, including the absence of standardized guidelines for employing RL techniques and a fragmented understanding of their underlying mechanisms. Additionally, inconsistent experimental settings, variations in training data, and differences in model initialization have led to conflicting conclusions, obscuring the key characteristics of these techniques and creating confusion among practitioners when selecting appropriate techniques. This paper systematically reviews widely adopted RL techniques through rigorous reproductions and isolated evaluations within a unified open-source framework. We analyze the internal mechanisms, applicable scenarios, and core principles of each technique through fine-grained experiments, including datasets of varying difficulty, model sizes, and architectures. Based on these insights, we present clear guidelines for selecting RL techniques tailored to specific setups, and provide a reliable roadmap for practitioners navigating the RL for the LLM domain. Finally, we reveal that a minimalist combination of two techniques can unlock the learning capability of critic-free policies using vanilla PPO loss. The results demonstrate that our simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

파트 I: 트릭인가, 함정인가? LLM 추론을 위한 RL 심층 분석

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning

초록

Support