R1-제로와 유사한 훈련 방식 이해: 비판적 관점

초록

DeepSeek-R1-Zero는 지도 학습 없이도 강화 학습(RL)을 대규모로 적용함으로써 대형 언어 모델(LLM)의 추론 능력을 직접 향상시킬 수 있음을 보여주었습니다. 본 연구에서는 R1-Zero와 유사한 학습 방식을 비판적으로 검토하며, 그 핵심 구성 요소인 기본 모델과 RL을 분석합니다. 우리는 DeepSeek-V3-Base를 포함한 다양한 기본 모델을 조사하여 사전 학습 특성이 RL 성능에 미치는 영향을 이해하고자 합니다. 분석 결과, DeepSeek-V3-Base는 이미 'Aha moment'를 보여주는 반면, Qwen2.5 기본 모델은 프롬프트 템플릿 없이도 강력한 추론 능력을 나타내어 사전 학습 편향의 가능성을 시사합니다. 또한, Group Relative Policy Optimization(GRPO)에서 최적화 편향이 발견되었는데, 이는 특히 잘못된 출력에 대해 응답 길이를 인위적으로 증가시키는 문제를 야기합니다. 이를 해결하기 위해, 우리는 토큰 효율성을 개선하면서도 추론 성능을 유지하는 편향 없는 최적화 방법인 Dr. GRPO를 제안합니다. 이러한 통찰을 바탕으로, 우리는 7B 기본 모델로 AIME 2024에서 43.3%의 정확도를 달성하며 새로운 최첨단 기술을 확립한 미니멀리스트 R1-Zero 레시피를 제시합니다. 우리의 코드는 https://github.com/sail-sg/understand-r1-zero에서 확인할 수 있습니다.

English

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

R1-제로와 유사한 훈련 방식 이해: 비판적 관점

Understanding R1-Zero-Like Training: A Critical Perspective

초록

Support