DAPO:一個開源的大規模語言模型強化學習系統
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
March 18, 2025
作者: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang
cs.AI
摘要
推理擴展賦予大型語言模型前所未有的推理能力,其中強化學習作為核心技術,能夠激發複雜的推理過程。然而,當前最先進的推理型大型語言模型的關鍵技術細節仍被隱藏(例如在OpenAI o1博客和DeepSeek R1技術報告中),因此學術界在重現其強化學習訓練成果方面仍面臨挑戰。我們提出了解耦剪裁與動態採樣策略優化(DAPO)算法,並完全開源了一個基於Qwen2.5-32B基礎模型、在AIME 2024上取得50分的最先進大規模強化學習系統。與以往隱瞞訓練細節的研究不同,我們詳細介紹了算法中使大規模語言模型強化學習成功的四項關鍵技術。此外,我們還開源了基於verl框架構建的訓練代碼,以及經過精心整理和處理的數據集。這些開源系統的組成部分增強了研究的可重現性,並為未來大規模語言模型強化學習的研究提供了支持。
English
Inference scaling empowers LLMs with unprecedented reasoning ability, with
reinforcement learning as the core technique to elicit complex reasoning.
However, key technical details of state-of-the-art reasoning LLMs are concealed
(such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the
community still struggles to reproduce their RL training results. We propose
the Decoupled Clip and Dynamic sAmpling
Policy Optimization (DAPO) algorithm, and
fully open-source a state-of-the-art large-scale RL system that achieves 50
points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that
withhold training details, we introduce four key techniques of our algorithm
that make large-scale LLM RL a success. In addition, we open-source our
training code, which is built on the verl framework, along with a carefully
curated and processed dataset. These components of our open-source system
enhance reproducibility and support future research in large-scale LLM RL.Summary
AI-Generated Summary