DAPO：一個開源的大規模語言模型強化學習系統

摘要

推理擴展賦予大型語言模型前所未有的推理能力，其中強化學習作為核心技術，能夠激發複雜的推理過程。然而，當前最先進的推理型大型語言模型的關鍵技術細節仍被隱藏（例如在OpenAI o1博客和DeepSeek R1技術報告中），因此學術界在重現其強化學習訓練成果方面仍面臨挑戰。我們提出了解耦剪裁與動態採樣策略優化（DAPO）算法，並完全開源了一個基於Qwen2.5-32B基礎模型、在AIME 2024上取得50分的最先進大規模強化學習系統。與以往隱瞞訓練細節的研究不同，我們詳細介紹了算法中使大規模語言模型強化學習成功的四項關鍵技術。此外，我們還開源了基於verl框架構建的訓練代碼，以及經過精心整理和處理的數據集。這些開源系統的組成部分增強了研究的可重現性，並為未來大規模語言模型強化學習的研究提供了支持。

English

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

DAPO：一個開源的大規模語言模型強化學習系統

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

摘要

Support