理解R1-Zero式訓練:一個批判性視角
Understanding R1-Zero-Like Training: A Critical Perspective
March 26, 2025
作者: Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
cs.AI
摘要
DeepSeek-R1-Zero 已證明,大規模的強化學習(RL)能夠直接提升大型語言模型(LLMs)的推理能力,而無需進行監督式微調。在本研究中,我們透過分析其兩個核心組件——基礎模型與強化學習,對 R1-Zero 類型的訓練進行了深入探討。我們研究了多種基礎模型,包括 DeepSeek-V3-Base,以理解預訓練特性如何影響強化學習的表現。我們的分析顯示,DeepSeek-V3-Base 已展現出「頓悟時刻」,而 Qwen2.5 基礎模型即便在沒有提示模板的情況下也展現出強大的推理能力,這暗示了潛在的預訓練偏差。此外,我們發現群組相對策略優化(GRPO)中存在一種優化偏差,這種偏差在訓練過程中會人為地增加回應長度(尤其是錯誤輸出)。為解決此問題,我們引入了 Dr. GRPO,這是一種無偏的優化方法,能在保持推理性能的同時提升詞元效率。基於這些洞察,我們提出了一種極簡的 R1-Zero 方案,該方案在 AIME 2024 上以 7B 基礎模型達到了 43.3% 的準確率,創下了新的技術巔峰。我們的程式碼已公開於 https://github.com/sail-sg/understand-r1-zero。
English
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can
directly enhance the reasoning capabilities of LLMs without supervised
fine-tuning. In this work, we critically examine R1-Zero-like training by
analyzing its two core components: base models and RL. We investigate a wide
range of base models, including DeepSeek-V3-Base, to understand how pretraining
characteristics influence RL performance. Our analysis reveals that
DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models
demonstrate strong reasoning capabilities even without prompt templates,
suggesting potential pretraining biases. Additionally, we identify an
optimization bias in Group Relative Policy Optimization (GRPO), which
artificially increases response length (especially for incorrect outputs)
during training. To address this, we introduce Dr. GRPO, an unbiased
optimization method that improves token efficiency while maintaining reasoning
performance. Leveraging these insights, we present a minimalist R1-Zero recipe
that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a
new state-of-the-art. Our code is available at
https://github.com/sail-sg/understand-r1-zero.Summary
AI-Generated Summary