DAPO：大規模オープンソースLLM強化学習システム

要旨

推論スケーリングは、大規模言語モデル（LLM）に前例のない推論能力を付与し、複雑な推論を引き出すための中核技術として強化学習（RL）を活用しています。しかし、最先端の推論LLMの重要な技術的詳細（例えば、OpenAIのo1ブログやDeepSeekのR1技術レポートなど）は非公開となっており、コミュニティは依然としてそれらのRLトレーニング結果を再現するのに苦労しています。本論文では、**Decoupled Clip and Dynamic sAmpling Policy Optimization（DAPO）アルゴリズム**を提案し、Qwen2.5-32Bベースモデルを使用してAIME 2024で50ポイントを達成する最先端の大規模RLシステムを完全にオープンソース化しました。これまでの研究とは異なり、トレーニングの詳細を公開せずに留めるのではなく、大規模LLM RLを成功させるためのアルゴリズムの4つの主要な技術を紹介します。さらに、**verlフレームワーク**上に構築されたトレーニングコードと、慎重に選別・処理されたデータセットをオープンソースとして公開します。これらのオープンソースシステムの構成要素は、再現性を高め、今後の大規模LLM RL研究を支援します。

English

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

DAPO：大規模オープンソースLLM強化学習システム

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

要旨

Support