共享即關懷:透過集體強化學習經驗共享實現高效語言模型後訓練
Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing
September 10, 2025
作者: Jeffrey Amico, Gabriel Passamani Andrade, John Donaghy, Ben Fielding, Tristin Forbus, Harry Grieve, Semih Kara, Jari Kolehmainen, Yihua Lou, Christopher Nies, Edward Phillip Flores Nuño, Diogo Ortega, Shikhar Rastogi, Austin Virts, Matthew J. Wright
cs.AI
摘要
透過強化學習(RL)對訓練後語言模型(LMs)進行後續處理,能夠在不需監督微調的情況下提升其複雜推理能力,這點已由DeepSeek-R1-Zero所證實。然而,要有效利用RL於LMs,需大幅並行化以擴展推理規模,這不僅引入了非輕微的技術挑戰(如延遲、記憶體與可靠性),還伴隨著不斷攀升的財務成本。我們提出了群體採樣策略優化(Swarm sAmpling Policy Optimization, SAPO),這是一種完全去中心化且非同步的RL後訓練算法。SAPO專為異質計算節點組成的去中心化網絡設計,其中每個節點管理自己的策略模型,同時與網絡中的其他節點“共享”軌跡;無需對延遲、模型同質性或硬件做出明確假設,節點亦可根據意願獨立運作。因此,該算法避免了在擴展RL後訓練時常見的瓶頸,同時也開啟(甚至鼓勵)了新的可能性。通過採樣網絡中“共享”的軌跡,它促成了“靈光一現”時刻的傳播,從而引導學習過程。本文中,我們展示了SAPO在控制實驗中實現了高達94%的累積獎勵增益。此外,我們還分享了在一個由Gensyn社區成員貢獻的數千節點網絡上進行測試的洞見,這些節點在開源演示期間於多樣化的硬件和模型上運行該算法。
English
Post-training language models (LMs) with reinforcement learning (RL) can
enhance their complex reasoning capabilities without supervised fine-tuning, as
demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs
requires significant parallelization to scale-up inference, which introduces
non-trivial technical challenges (e.g. latency, memory, and reliability)
alongside ever-growing financial costs. We present Swarm sAmpling Policy
Optimization (SAPO), a fully decentralized and asynchronous RL post-training
algorithm. SAPO is designed for decentralized networks of heterogenous compute
nodes, where each node manages its own policy model(s) while "sharing" rollouts
with others in the network; no explicit assumptions about latency, model
homogeneity, or hardware are required and nodes can operate in silo if desired.
As a result, the algorithm avoids common bottlenecks in scaling RL
post-training while also allowing (and even encouraging) new possibilities. By
sampling rollouts "shared" across the network, it enables "Aha moments" to
propagate, thereby bootstrapping the learning process. In this paper we show
SAPO achieved cumulative reward gains of up to 94% in controlled experiments.
We also share insights from tests on a network with thousands of nodes
contributed by Gensyn community members running the algorithm on diverse
hardware and models during an open-source demo.