共享即关怀：通过集体强化学习经验共享实现高效语言模型后训练

摘要

通过强化学习（RL）对语言模型（LMs）进行后训练，无需监督微调即可提升其复杂推理能力，这一点已由DeepSeek-R1-Zero所证实。然而，有效利用RL于LMs需要大规模的并行化以扩展推理能力，这不仅带来了非平凡的技术挑战（如延迟、内存和可靠性问题），还伴随着不断攀升的经济成本。我们提出了Swarm采样策略优化（SAPO），一种完全去中心化且异步的RL后训练算法。SAPO专为异构计算节点组成的去中心化网络设计，其中每个节点管理自己的策略模型，同时与网络中的其他节点“共享”轨迹；无需对延迟、模型同质性或硬件做出明确假设，节点也可根据需要独立运行。因此，该算法在扩展RL后训练时避免了常见的瓶颈，同时允许（甚至鼓励）新的可能性。通过采样网络中“共享”的轨迹，它促使“顿悟时刻”传播，从而引导学习过程。本文中，我们展示了SAPO在控制实验中实现了高达94%的累积奖励增益。我们还分享了在开源演示期间，由Gensyn社区成员贡献的数千个节点网络上的测试见解，这些节点在多样化的硬件和模型上运行该算法。

English

Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero. However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.

共享即关怀：通过集体强化学习经验共享实现高效语言模型后训练

Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

摘要

Support