Light-R1:從零開始及超越的長鏈思維課程式監督微調、直接偏好優化與強化學習
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
March 13, 2025
作者: Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
cs.AI
摘要
本文介紹了我們在Light-R1系列上的工作,並公開了模型、數據和代碼。我們首先專注於從零開始訓練長鏈思維(COT)模型,特別是從最初缺乏長鏈思維能力的模型入手。通過採用包含兩階段監督微調(SFT)和半在線策略直接偏好優化(DPO)的課程訓練方案,我們從Qwen2.5-32B-Instruct訓練出了Light-R1-32B模型,其在數學表現上優於DeepSeek-R1-Distill-Qwen-32B。儘管僅在數學數據上進行訓練,Light-R1-32B在其他領域也展現出強大的泛化能力。在後續工作中,我們強調了為第二階段SFT構建的3k數據集對提升其他模型的顯著益處。通過使用該數據集微調DeepSeek-R1-Distilled模型,我們在7B和14B規模上獲得了新的SOTA模型,而32B模型Light-R1-32B-DS的表現與QwQ-32B和DeepSeek-R1相當。此外,我們通過在長鏈思維模型上應用強化學習(特別是GRPO)進一步提升了推理性能。我們成功訓練出了最終的Light-R1-14B-DS模型,其在數學領域的14B參數模型中達到了SOTA水平。憑藉AIME24和AIME25分別為74.0和60.2的成績,Light-R1-14B-DS超越了許多32B模型以及DeepSeek-R1-Distill-Llama-70B。其強化學習訓練也展現出預期的良好行為,即響應長度和獎勵分數同步提升。Light-R1系列的工作驗證了從零開始訓練長鏈思維模型的可行性,展示了SFT數據的藝術,並通過強化學習發布了SOTA模型。
English
This paper presents our work on the Light-R1 series, with models, data, and
code all released.
We first focus on training long COT models from scratch, specifically
starting from models initially lacking long COT capabilities. Using a
curriculum training recipe consisting of two-stage SFT and semi-on-policy DPO,
we train our model Light-R1-32B from Qwen2.5-32B-Instruct, resulting in
superior math performance compared to DeepSeek-R1-Distill-Qwen-32B. Despite
being trained exclusively on math data, Light-R1-32B shows strong
generalization across other domains. In the subsequent phase of this work, we
highlight the significant benefit of the 3k dataset constructed for the second
SFT stage on enhancing other models. By fine-tuning DeepSeek-R1-Distilled
models using this dataset, we obtain new SOTA models in 7B and 14B, while the
32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1.
Furthermore, we extend our work by applying reinforcement learning,
specifically GRPO, on long-COT models to further improve reasoning performance.
We successfully train our final Light-R1-14B-DS with RL, achieving SOTA
performance among 14B parameter models in math. With AIME24 & 25 scores of 74.0
and 60.2 respectively, Light-R1-14B-DS surpasses even many 32B models and
DeepSeek-R1-Distill-Llama-70B. Its RL training also exhibits well expected
behavior, showing simultaneous increase in response length and reward score.
The Light-R1 series of work validates training long-COT models from scratch,
showcases the art in SFT data and releases SOTA models from RL.Summary
AI-Generated Summary