INTELLECT-2: グローバルに分散された強化学習を通じて訓練された推論モデル

要旨

私たちは、320億パラメータの言語モデルにおける初のグローバル分散型強化学習（RL）トレーニングであるINTELLECT-2を紹介します。従来の集中型トレーニングとは異なり、INTELLECT-2は、動的で異種混合のパーミッションレスな計算リソース提供者の群衆を活用し、完全に非同期なRLを用いて推論モデルをトレーニングします。このユニークなインフラストラクチャでのトレーニングを可能にするため、私たちはさまざまなコンポーネントを一から構築しました。分散非同期強化学習のために特別に設計されたトレーニングフレームワークであるPRIME-RLを導入し、信頼できない推論ワーカーからのロールアウトを検証するTOPLOCや、トレーニングノードから推論ワーカーへ効率的にポリシーの重みをブロードキャストするSHARDCASTといった新規コンポーネントを基盤としています。インフラストラクチャコンポーネントに加えて、標準的なGRPOトレーニングレシピとデータフィルタリング技術に修正を加え、トレーニングの安定性を確保し、モデルがトレーニング目標を成功裏に学習することを可能にしました。これにより、320億パラメータ範囲における最先端の推論モデルであるQwQ-32Bを改善しました。私たちは、INTELLECT-2とすべてのコードおよびデータをオープンソースとして公開し、分散型トレーニングの分野におけるさらなるオープンな研究を促進し、可能にすることを期待しています。

English

We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range. We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training.

INTELLECT-2: グローバルに分散された強化学習を通じて訓練された推論モデル

INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

要旨

Support