INTELLECT-2: 전역적으로 분산된 강화 학습을 통해 훈련된 추론 모델

초록

우리는 320억 파라미터 규모의 언어 모델을 대상으로 전 세계적으로 분산된 최초의 강화 학습(RL) 훈련 실행인 INTELLECT-2를 소개합니다. 기존의 중앙 집중식 훈련 방식과 달리, INTELLECT-2는 허가 없이 참여할 수 있는 동적이고 이질적인 컴퓨팅 자원 집단을 통해 완전히 비동기적인 강화 학습을 사용하여 추론 모델을 훈련합니다. 이 독특한 인프라를 통해 훈련을 실행하기 위해, 우리는 여러 구성 요소를 처음부터 구축했습니다: PRIME-RL을 소개하는데, 이는 분산 비동기 강화 학습을 위해 특별히 설계된 훈련 프레임워크로, 신뢰할 수 없는 추론 작업자로부터의 롤아웃을 검증하는 TOPLOC와 훈련 노드에서 추론 작업자로 정책 가중치를 효율적으로 브로드캐스트하는 SHARDCAST와 같은 새로운 구성 요소를 기반으로 합니다. 인프라 구성 요소를 넘어, 우리는 표준 GRPO 훈련 레시피와 데이터 필터링 기술에 대한 수정을 제안했습니다. 이는 훈련 안정성을 달성하고 모델이 훈련 목표를 성공적으로 학습하도록 보장하는 데 결정적이었으며, 이를 통해 320억 파라미터 범위에서 최첨단 추론 모델인 QwQ-32B를 개선했습니다. 우리는 INTELLECT-2와 모든 코드 및 데이터를 오픈소스로 공개하여, 분산 훈련 분야에서 더 많은 개방형 연구를 장려하고 가능하게 하기를 희망합니다.

English

We introduce INTELLECT-2, the first globally distributed reinforcement learning (RL) training run of a 32 billion parameter language model. Unlike traditional centralized training efforts, INTELLECT-2 trains a reasoning model using fully asynchronous RL across a dynamic, heterogeneous swarm of permissionless compute contributors. To enable a training run with this unique infrastructure, we built various components from scratch: we introduce PRIME-RL, our training framework purpose-built for distributed asynchronous reinforcement learning, based on top of novel components such as TOPLOC, which verifies rollouts from untrusted inference workers, and SHARDCAST, which efficiently broadcasts policy weights from training nodes to inference workers. Beyond infrastructure components, we propose modifications to the standard GRPO training recipe and data filtering techniques that were crucial to achieve training stability and ensure that our model successfully learned its training objective, thus improving upon QwQ-32B, the state of the art reasoning model in the 32B parameter range. We open-source INTELLECT-2 along with all of our code and data, hoping to encourage and enable more open research in the field of decentralized training.

INTELLECT-2: 전역적으로 분산된 강화 학습을 통해 훈련된 추론 모델

INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning

초록

Support