INTELLECT-2:一個通過全球分散式強化學習訓練的推理模型
INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning
May 12, 2025
作者: Prime Intellect Team, Sami Jaghouar, Justus Mattern, Jack Min Ong, Jannik Straube, Manveer Basra, Aaron Pazdera, Kushal Thaman, Matthew Di Ferrante, Felix Gabriel, Fares Obeid, Kemal Erdem, Michael Keiblinger, Johannes Hagemann
cs.AI
摘要
我們推出INTELLECT-2,這是首個全球分佈式強化學習(RL)訓練的320億參數語言模型。與傳統的集中式訓練不同,INTELLECT-2在一個動態、異構的無許可計算貢獻者群體中,使用完全異步的RL訓練推理模型。
為了在這種獨特的基礎設施上進行訓練,我們從零開始構建了多個組件:我們引入了PRIME-RL,這是一個專為分佈式異步強化學習設計的訓練框架,基於諸如TOPLOC等新穎組件,TOPLOC用於驗證來自不可信推理工作者的rollouts,以及SHARDCAST,它高效地將策略權重從訓練節點廣播到推理工作者。
除了基礎設施組件,我們還對標準GRPO訓練配方和數據過濾技術提出了修改,這些修改對於實現訓練穩定性並確保我們的模型成功學習其訓練目標至關重要,從而改進了320億參數範圍內最先進的推理模型QwQ-32B。
我們開源了INTELLECT-2以及所有代碼和數據,希望鼓勵並推動去中心化訓練領域的更多開放研究。
English
We introduce INTELLECT-2, the first globally distributed reinforcement
learning (RL) training run of a 32 billion parameter language model. Unlike
traditional centralized training efforts, INTELLECT-2 trains a reasoning model
using fully asynchronous RL across a dynamic, heterogeneous swarm of
permissionless compute contributors.
To enable a training run with this unique infrastructure, we built various
components from scratch: we introduce PRIME-RL, our training framework
purpose-built for distributed asynchronous reinforcement learning, based on top
of novel components such as TOPLOC, which verifies rollouts from untrusted
inference workers, and SHARDCAST, which efficiently broadcasts policy weights
from training nodes to inference workers.
Beyond infrastructure components, we propose modifications to the standard
GRPO training recipe and data filtering techniques that were crucial to achieve
training stability and ensure that our model successfully learned its training
objective, thus improving upon QwQ-32B, the state of the art reasoning model in
the 32B parameter range.
We open-source INTELLECT-2 along with all of our code and data, hoping to
encourage and enable more open research in the field of decentralized training.Summary
AI-Generated Summary