ChatPaper.aiChatPaper

DAPO:一个开源的大规模语言模型强化学习系统

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

March 18, 2025
作者: Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Lin Yan, Mu Qiao, Yonghui Wu, Mingxuan Wang
cs.AI

摘要

推理扩展赋予了大语言模型前所未有的推理能力,其中强化学习作为核心技术,能够激发复杂的推理过程。然而,当前顶尖推理大语言模型的关键技术细节(如OpenAI o1博客和DeepSeek R1技术报告中所提及的)仍被保密,导致业界在复现其强化学习训练成果方面面临重重困难。我们提出了解耦裁剪与动态采样策略优化(DAPO)算法,并完全开源了一套基于Qwen2.5-32B基础模型、在AIME 2024上取得50分成绩的先进大规模强化学习系统。与以往隐藏训练细节的做法不同,我们详细介绍了算法成功的四大关键技术。此外,我们开源了基于verl框架构建的训练代码,以及经过精心筛选和处理的数据集。这些开源系统组件不仅提升了研究的可复现性,也为未来大规模大语言模型强化学习研究提供了有力支持。
English
Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

Summary

AI-Generated Summary

PDF1225March 19, 2025