ChatPaper.aiChatPaper

INTELLECT-2:通过全球分布式强化学习训练而成的推理模型

INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

May 12, 2025
作者: Prime Intellect Team, Sami Jaghouar, Justus Mattern, Jack Min Ong, Jannik Straube, Manveer Basra, Aaron Pazdera, Kushal Thaman, Matthew Di Ferrante, Felix Gabriel, Fares Obeid, Kemal Erdem, Michael Keiblinger, Johannes Hagemann
cs.AI

摘要

我们推出了INTELLECT-2,这是首个在全球范围内分布式进行的320亿参数语言模型强化学习(RL)训练项目。与传统的集中式训练不同,INTELLECT-2通过完全异步的强化学习,在一个动态、异构且无需许可的计算贡献者群体中训练推理模型。 为了支持这一独特基础设施下的训练运行,我们从零构建了多个组件:我们引入了PRIME-RL,这是一个专为分布式异步强化学习设计的训练框架,其基础包括诸如TOPLOC这样的新组件,用于验证来自不可信推理工作者的rollout数据,以及SHARDCAST,它高效地将策略权重从训练节点广播到推理工作者。 除了基础设施组件外,我们还对标准的GRPO训练配方和数据过滤技术提出了改进,这些改进对于实现训练稳定性、确保模型成功学习其训练目标至关重要,从而在320亿参数范围内超越了当前最先进的推理模型QwQ-32B。 我们将INTELLECT-2及其所有代码和数据开源,希望以此鼓励并推动去中心化训练领域内更多的开放研究。
English
We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range. We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training.

Summary

AI-Generated Summary

PDF92May 13, 2025